Functional binaryfile indexing with IFilters on Windows

Author Message

Jonathan Cutting

Tuesday 01 February 2005 4:25:47 pm

All,

I posted earlier about using Microsoft's IFilters to index binary files on Windows installations of ezPublish. After some rummaging about I got it to work - I've tried PPT, DOC, and PDF so far and it works well.

The idea is that instead of using pstotext or pdftotext for PDF files and wvware for Word files, we can use IFilters provided by Microsoft (doc, xls, ppt, vsd, etc.), Adobe (PDF), and many others (SXW, DWG, etc.). Microsoft has a utility called filtdump - it's a command line utility that uses IFilters to dump the text content of a file to stdout.

This hack does require a new plugin for the ezbinaryfile type.

Here are my rough notes:

Install Indexing Service - the service doesn't actually have to be running, but you'll need it installed.

Download the Microsoft Platform SDK. Copy filtdump.exe from the bin directory to a conveniently located folder on the system path.

Override binaryfile.ini as follows:

[HandlerSettings]
MetaDataExtractor[application/msword]=ifilter
MetaDataExtractor[application/vnd.ms-excel]=ifilter
MetaDataExtractor[application/pdf]=ifilter
MetaDataExtractor[application/vnd.ms-powerpoint]=ifilter
MetaDataExtractor[application/vnd.visio]=ifilter

[IFilterHandlerSettings]
TextExtractionTool=filtdump.exe -b

Find the directory ezpublish/kernel/classes/datatypes/ezbinaryfile/plugins. Copy ezwordparser.php and rename the copy ezifilterparser.php. Replace all instances of "word" with "ifilter".

Change the File class so that the file attribute is searchable.

Increase the maximum query size on MySQL:

[mysqld]
set-variable = max_allowed_packet=16M

Restart MySQL.

Consider turning on delayed indexing and indexing by cron job - it can take a long time to index files when uploading and it gets annoying.

Consider enabling wildcard searches.

Need to set MIME type for Excel in ezpublish - that's described somewhere on the ez.no site.

Clear all caches.

Manual indexing must be done by copying php.exe from cli directory of the current PHP distribution (zip). Rename it phpcli.exe and place it in the PHP directory used by ezpublish. From the ezpublish directory, run the script

 ..\php\phpcli -C update\common\scripts\updatesearchindex.php --clean

Kristian Hole

Wednesday 02 February 2005 7:46:19 am

Cool :)

Will you add this under documentation/customization/tips_tricks ?

Kristian

http://ez.no/ez_publish/documenta...tricks/show_which_templates_are_used
http://ez.no/doc/ez_publish/techn...te_operators/miscellaneous/attribute

Jonathan Cutting

Wednesday 02 February 2005 1:09:52 pm

Done.

Kristian Hole

Thursday 03 February 2005 2:42:01 am

Great. Thank you!

Kristian

http://ez.no/ez_publish/documenta...tricks/show_which_templates_are_used
http://ez.no/doc/ez_publish/techn...te_operators/miscellaneous/attribute

Powered by eZ Publish™ CMS Open Source Web Content Management. Copyright © 1999-2014 eZ Systems AS (except where otherwise noted). All rights reserved.

eZ debug

Timing: Jan 31 2025 03:42:52
Script start
Timing: Jan 31 2025 03:42:52
Module start 'layout'
Timing: Jan 31 2025 03:42:52
Module start 'content'
Timing: Jan 31 2025 03:42:52
Module end 'content'
Timing: Jan 31 2025 03:42:52
Script end

Main resources:

Total runtime0.0221 sec
Peak memory usage4,096.0000 KB
Database Queries3

Timing points:

CheckpointStart (sec)Duration (sec)Memory at start (KB)Memory used (KB)
Script start 0.00000.0043 588.4141151.2422
Module start 'layout' 0.00430.0037 739.6563220.7344
Module start 'content' 0.00800.0127 960.39061,001.8984
Module end 'content' 0.02070.0013 1,962.289133.9922
Script end 0.0221  1,996.2813 

Time accumulators:

 Accumulator Duration (sec) Duration (%) Count Average (sec)
Ini load
Load cache0.002410.9641140.0002
Check MTime0.00104.5091140.0001
Mysql Total
Database connection0.00062.603110.0006
Mysqli_queries0.00198.766130.0006
Looping result0.00000.054910.0000
Template Total0.00104.510.0010
Template load0.00083.534010.0008
Template processing0.00020.963210.0002
Override
Cache load0.00052.463010.0005
General
dbfile0.00031.203580.0000
String conversion0.00000.036640.0000
Note: percentages do not add up to 100% because some accumulators overlap

Templates used to render the page:

UsageRequested templateTemplateTemplate loadedEditOverride
1print_pagelayout.tpl<No override>extension/community/design/community/templates/print_pagelayout.tplEdit templateOverride template
 Number of times templates used: 1
 Number of unique templates used: 1

Time used to render debug report: 0.0001 secs