Using IFilters for indexing binary files

Author Message

Jonathan Cutting

Thursday 20 January 2005 1:08:34 pm

Perhaps someone has already covered this but I find no mention of it in the documentation. For those of you working on Windows, a convenient method for indexing binary files is the IFilter mechanism used in Microsoft Indexing Service.

The Microsoft Platform SDK has an executable in the bin directory called FiltDump.exe. It takes the name of a file as an argument and uses the registered IFilter, if any, to print the file's text content to stdout.

For example, the command

filtdump -b test.doc

 

will dump the contents of test.doc to stdout using the IFilter registered for .doc files. The -b switch turns off error messages and other extraneous information. Note that Indexing Service must be installed but it need not be running for this to work.

IFilters for HTML, Word, Excel, Visio, Powerpoint, and plain text are available from Microsoft. An IFilter for PDF is available from Adobe. Others - including StarOffice/OpenOffice, DWG, etc. are available commercially.

Now, I've tried to implement this in ezPublish 3.5.0 (Windows installer version) but without success. I've overridden binaryfile.ini, I've cleared all caches, I've rebuilt the search index manually with the --clean option, and I've marked binary file attributes as searchable in classes of interest. Still no luck.

My binaryfile.ini overrides:

[HandlerSettings]
MetaDataExtractor[application/pdf]=IFilter
MetaDataExtractor[application/msword]=IFilter

[IFilterHandlerSettings]
TextExtractionTool=filtdump -b

I've tried locating filtdump.exe in a number of different places, including in the ezpublish root and in a directory on the system search path. I have no evidence that it's being executed at all. I've also tried making it run a batch script:

@ECHO OFF
ECHO %1
filtdump -b %1

 

Still no luck.

Can someone please help me understand what needs to be done to make this work? Where should filtdump.exe be located? Do Apache or PHP need to be configured any differently? Again, I'm using the basic Windows installer for 3.5.0 - nothing special.

Jonathan

Powered by eZ Publish™ CMS Open Source Web Content Management. Copyright © 1999-2014 eZ Systems AS (except where otherwise noted). All rights reserved.

eZ debug

Timing: Jan 19 2025 06:13:28
Script start
Timing: Jan 19 2025 06:13:28
Module start 'layout'
Timing: Jan 19 2025 06:13:28
Module start 'content'
Timing: Jan 19 2025 06:13:29
Module end 'content'
Timing: Jan 19 2025 06:13:29
Script end

Main resources:

Total runtime0.5421 sec
Peak memory usage4,096.0000 KB
Database Queries46

Timing points:

CheckpointStart (sec)Duration (sec)Memory at start (KB)Memory used (KB)
Script start 0.00000.0039 590.7422152.6406
Module start 'layout' 0.00390.0018 743.382839.4766
Module start 'content' 0.00570.5350 782.8594410.2344
Module end 'content' 0.54070.0014 1,193.09388.1250
Script end 0.5420  1,201.2188 

Time accumulators:

 Accumulator Duration (sec) Duration (%) Count Average (sec)
Ini load
Load cache0.00290.5427160.0002
Check MTime0.00120.2276160.0001
Mysql Total
Database connection0.00050.087310.0005
Mysqli_queries0.505993.3238460.0110
Looping result0.00030.0594440.0000
Template Total0.517895.520.2589
Template load0.00190.345720.0009
Template processing0.516095.182920.2580
Template load and register function0.00010.027510.0001
states
state_id_array0.00070.129110.0007
state_identifier_array0.00100.177320.0005
Override
Cache load0.00150.2772190.0001
Sytem overhead
Fetch class attribute can translate value0.00050.087310.0005
Fetch class attribute name0.00080.146510.0008
XML
Image XML parsing0.00010.017010.0001
class_abstraction
Instantiating content class attribute0.00000.000710.0000
General
dbfile0.00060.1157100.0001
String conversion0.00000.001140.0000
Note: percentages do not add up to 100% because some accumulators overlap

Templates used to render the page:

UsageRequested templateTemplateTemplate loadedEditOverride
1node/view/full.tplfull/forum_topic.tplextension/sevenx/design/simple/override/templates/full/forum_topic.tplEdit templateOverride template
1content/datatype/view/ezxmltext.tpl<No override>extension/community_design/design/suncana/templates/content/datatype/view/ezxmltext.tplEdit templateOverride template
4content/datatype/view/ezxmltags/paragraph.tpl<No override>extension/ezwebin/design/ezwebin/templates/content/datatype/view/ezxmltags/paragraph.tplEdit templateOverride template
3content/datatype/view/ezxmltags/literal.tpl<No override>extension/community/design/standard/templates/content/datatype/view/ezxmltags/literal.tplEdit templateOverride template
1print_pagelayout.tpl<No override>extension/community/design/community/templates/print_pagelayout.tplEdit templateOverride template
 Number of times templates used: 10
 Number of unique templates used: 5

Time used to render debug report: 0.0001 secs