Functional binaryfile indexing with IFilters on Windows

Author Message

Jonathan Cutting

Tuesday 01 February 2005 4:25:47 pm

All,

I posted earlier about using Microsoft's IFilters to index binary files on Windows installations of ezPublish. After some rummaging about I got it to work - I've tried PPT, DOC, and PDF so far and it works well.

The idea is that instead of using pstotext or pdftotext for PDF files and wvware for Word files, we can use IFilters provided by Microsoft (doc, xls, ppt, vsd, etc.), Adobe (PDF), and many others (SXW, DWG, etc.). Microsoft has a utility called filtdump - it's a command line utility that uses IFilters to dump the text content of a file to stdout.

This hack does require a new plugin for the ezbinaryfile type.

Here are my rough notes:

Install Indexing Service - the service doesn't actually have to be running, but you'll need it installed.

Download the Microsoft Platform SDK. Copy filtdump.exe from the bin directory to a conveniently located folder on the system path.

Override binaryfile.ini as follows:

[HandlerSettings]
MetaDataExtractor[application/msword]=ifilter
MetaDataExtractor[application/vnd.ms-excel]=ifilter
MetaDataExtractor[application/pdf]=ifilter
MetaDataExtractor[application/vnd.ms-powerpoint]=ifilter
MetaDataExtractor[application/vnd.visio]=ifilter

[IFilterHandlerSettings]
TextExtractionTool=filtdump.exe -b

Find the directory ezpublish/kernel/classes/datatypes/ezbinaryfile/plugins. Copy ezwordparser.php and rename the copy ezifilterparser.php. Replace all instances of "word" with "ifilter".

Change the File class so that the file attribute is searchable.

Increase the maximum query size on MySQL:

[mysqld]
set-variable = max_allowed_packet=16M

Restart MySQL.

Consider turning on delayed indexing and indexing by cron job - it can take a long time to index files when uploading and it gets annoying.

Consider enabling wildcard searches.

Need to set MIME type for Excel in ezpublish - that's described somewhere on the ez.no site.

Clear all caches.

Manual indexing must be done by copying php.exe from cli directory of the current PHP distribution (zip). Rename it phpcli.exe and place it in the PHP directory used by ezpublish. From the ezpublish directory, run the script

 ..\php\phpcli -C update\common\scripts\updatesearchindex.php --clean

Kristian Hole

Wednesday 02 February 2005 7:46:19 am

Cool :)

Will you add this under documentation/customization/tips_tricks ?

Kristian

http://ez.no/ez_publish/documenta...tricks/show_which_templates_are_used
http://ez.no/doc/ez_publish/techn...te_operators/miscellaneous/attribute

Jonathan Cutting

Wednesday 02 February 2005 1:09:52 pm

Done.

Kristian Hole

Thursday 03 February 2005 2:42:01 am

Great. Thank you!

Kristian

http://ez.no/ez_publish/documenta...tricks/show_which_templates_are_used
http://ez.no/doc/ez_publish/techn...te_operators/miscellaneous/attribute

Powered by eZ Publish™ CMS Open Source Web Content Management. Copyright © 1999-2014 eZ Systems AS (except where otherwise noted). All rights reserved.