Forums / Install & configuration / Functional binaryfile indexing with IFilters on Windows

Functional binaryfile indexing with IFilters on Windows

Author Message

Jonathan Cutting

Tuesday 01 February 2005 4:25:47 pm

All,

I posted earlier about using Microsoft's IFilters to index binary files on Windows installations of ezPublish. After some rummaging about I got it to work - I've tried PPT, DOC, and PDF so far and it works well.

The idea is that instead of using pstotext or pdftotext for PDF files and wvware for Word files, we can use IFilters provided by Microsoft (doc, xls, ppt, vsd, etc.), Adobe (PDF), and many others (SXW, DWG, etc.). Microsoft has a utility called filtdump - it's a command line utility that uses IFilters to dump the text content of a file to stdout.

This hack does require a new plugin for the ezbinaryfile type.

Here are my rough notes:

Install Indexing Service - the service doesn't actually have to be running, but you'll need it installed.

Download the Microsoft Platform SDK. Copy filtdump.exe from the bin directory to a conveniently located folder on the system path.

Override binaryfile.ini as follows:

[HandlerSettings]
MetaDataExtractor[application/msword]=ifilter
MetaDataExtractor[application/vnd.ms-excel]=ifilter
MetaDataExtractor[application/pdf]=ifilter
MetaDataExtractor[application/vnd.ms-powerpoint]=ifilter
MetaDataExtractor[application/vnd.visio]=ifilter

[IFilterHandlerSettings]
TextExtractionTool=filtdump.exe -b

Find the directory ezpublish/kernel/classes/datatypes/ezbinaryfile/plugins. Copy ezwordparser.php and rename the copy ezifilterparser.php. Replace all instances of "word" with "ifilter".

Change the File class so that the file attribute is searchable.

Increase the maximum query size on MySQL:

[mysqld]
set-variable = max_allowed_packet=16M

Restart MySQL.

Consider turning on delayed indexing and indexing by cron job - it can take a long time to index files when uploading and it gets annoying.

Consider enabling wildcard searches.

Need to set MIME type for Excel in ezpublish - that's described somewhere on the ez.no site.

Clear all caches.

Manual indexing must be done by copying php.exe from cli directory of the current PHP distribution (zip). Rename it phpcli.exe and place it in the PHP directory used by ezpublish. From the ezpublish directory, run the script

 ..\php\phpcli -C update\common\scripts\updatesearchindex.php --clean

Kristian Hole

Wednesday 02 February 2005 7:46:19 am

Cool :)

Will you add this under documentation/customization/tips_tricks ?

Kristian

http://ez.no/ez_publish/documenta...tricks/show_which_templates_are_used
http://ez.no/doc/ez_publish/techn...te_operators/miscellaneous/attribute

Jonathan Cutting

Wednesday 02 February 2005 1:09:52 pm

Done.

Kristian Hole

Thursday 03 February 2005 2:42:01 am

Great. Thank you!

Kristian

http://ez.no/ez_publish/documenta...tricks/show_which_templates_are_used
http://ez.no/doc/ez_publish/techn...te_operators/miscellaneous/attribute

eZ debug

Timing: Jan 18 2025 21:00:51
Script start
Timing: Jan 18 2025 21:00:51
Module start 'content'
Timing: Jan 18 2025 21:00:51
Module end 'content'
Timing: Jan 18 2025 21:00:52
Script end

Main resources:

Total runtime0.9174 sec
Peak memory usage4,096.0000 KB
Database Queries197

Timing points:

CheckpointStart (sec)Duration (sec)Memory at start (KB)Memory used (KB)
Script start 0.00010.0104 587.9766180.8047
Module start 'content' 0.01050.7235 768.7813553.1406
Module end 'content' 0.73400.1833 1,321.9219337.0625
Script end 0.9173  1,658.9844 

Time accumulators:

 Accumulator Duration (sec) Duration (%) Count Average (sec)
Ini load
Load cache0.00440.4763210.0002
Check MTime0.00170.1831210.0001
Mysql Total
Database connection0.00270.292910.0027
Mysqli_queries0.840191.57181970.0043
Looping result0.00240.25901950.0000
Template Total0.886796.720.4433
Template load0.00170.188520.0009
Template processing0.884996.462020.4425
Template load and register function0.00020.018510.0002
states
state_id_array0.00080.082910.0008
state_identifier_array0.00090.099420.0005
Override
Cache load0.00150.1581320.0000
Sytem overhead
Fetch class attribute can translate value0.00130.137930.0004
Fetch class attribute name0.00100.105750.0002
XML
Image XML parsing0.00090.094630.0003
class_abstraction
Instantiating content class attribute0.00000.001560.0000
General
dbfile0.00340.3725270.0001
String conversion0.00000.000830.0000
Note: percentages do not add up to 100% because some accumulators overlap

CSS/JS files loaded with "ezjscPacker" during request:

CacheTypePacklevelSourceFiles
CSS0extension/community/design/community/stylesheets/ext/jquery.autocomplete.css
extension/community_design/design/suncana/stylesheets/scrollbars.css
extension/community_design/design/suncana/stylesheets/tabs.css
extension/community_design/design/suncana/stylesheets/roadmap.css
extension/community_design/design/suncana/stylesheets/content.css
extension/community_design/design/suncana/stylesheets/star-rating.css
extension/community_design/design/suncana/stylesheets/syntax_and_custom_tags.css
extension/community_design/design/suncana/stylesheets/buttons.css
extension/community_design/design/suncana/stylesheets/tweetbox.css
extension/community_design/design/suncana/stylesheets/jquery.fancybox-1.3.4.css
extension/bcsmoothgallery/design/standard/stylesheets/magnific-popup.css
extension/sevenx/design/simple/stylesheets/star_rating.css
extension/sevenx/design/simple/stylesheets/libs/fontawesome/css/all.min.css
extension/sevenx/design/simple/stylesheets/main.v02.css
extension/sevenx/design/simple/stylesheets/main.v02.res.css
JS0extension/ezjscore/design/standard/lib/yui/3.17.2/build/yui/yui-min.js
extension/ezjscore/design/standard/javascript/jquery-3.7.0.min.js
extension/community_design/design/suncana/javascript/jquery.ui.core.min.js
extension/community_design/design/suncana/javascript/jquery.ui.widget.min.js
extension/community_design/design/suncana/javascript/jquery.easing.1.3.js
extension/community_design/design/suncana/javascript/jquery.ui.tabs.js
extension/community_design/design/suncana/javascript/jquery.hoverIntent.min.js
extension/community_design/design/suncana/javascript/jquery.popmenu.js
extension/community_design/design/suncana/javascript/jScrollPane.js
extension/community_design/design/suncana/javascript/jquery.mousewheel.js
extension/community_design/design/suncana/javascript/jquery.cycle.all.js
extension/sevenx/design/simple/javascript/jquery.scrollTo.js
extension/community_design/design/suncana/javascript/jquery.cookie.js
extension/community_design/design/suncana/javascript/ezstarrating_jquery.js
extension/community_design/design/suncana/javascript/jquery.initboxes.js
extension/community_design/design/suncana/javascript/app.js
extension/community_design/design/suncana/javascript/twitterwidget.js
extension/community_design/design/suncana/javascript/community.js
extension/community_design/design/suncana/javascript/roadmap.js
extension/community_design/design/suncana/javascript/ez.js
extension/community_design/design/suncana/javascript/ezshareevents.js
extension/sevenx/design/simple/javascript/main.js

Templates used to render the page:

UsageRequested templateTemplateTemplate loadedEditOverride
1node/view/full.tplfull/forum_topic.tplextension/sevenx/design/simple/override/templates/full/forum_topic.tplEdit templateOverride template
4content/datatype/view/ezxmltext.tpl<No override>extension/community_design/design/suncana/templates/content/datatype/view/ezxmltext.tplEdit templateOverride template
6content/datatype/view/ezxmltags/paragraph.tpl<No override>extension/ezwebin/design/ezwebin/templates/content/datatype/view/ezxmltags/paragraph.tplEdit templateOverride template
3content/datatype/view/ezxmltags/literal.tpl<No override>extension/community/design/standard/templates/content/datatype/view/ezxmltags/literal.tplEdit templateOverride template
2content/datatype/view/ezimage.tpl<No override>extension/sevenx/design/simple/templates/content/datatype/view/ezimage.tplEdit templateOverride template
1pagelayout.tpl<No override>extension/sevenx/design/simple/templates/pagelayout.tplEdit templateOverride template
 Number of times templates used: 17
 Number of unique templates used: 6

Time used to render debug report: 0.0001 secs