Forums / Install & configuration / PDF indexing

PDF indexing

Author Message

Michael Hall

Wednesday 05 March 2008 4:15:08 am

Does EZ4's search indexing capability include indexing PDF files?

Tony Wood

Wednesday 05 March 2008 5:06:47 am

For this I would recommend using eZ Find. It is based on Lucene so does a very good job.

Tony Wood : twitter.com/tonywood
Vision with Technology
Experts in eZ Publish consulting & development

Power to the Editor!

Free eZ Training : http://www.VisionWT.com/training
eZ Future Podcast : http://www.VisionWT.com/eZ-Future

Michael Hall

Wednesday 05 March 2008 4:40:35 pm

OK, well I got the old PDF indexing working using pdftotext ... much the same mechanism as good old htdig.

eZ Find seems to be the way to go, even though it means adding a JRE to the server to run it.

We'll be indexing quite large PDF files (historical newspaper collection).

I'm yet to run a trial, but I'm wondering how eZFind works at the database end and whether potentially very large chunks of data will cause problems, as seemed to happen with the old system?

Also, does anyone know if eZ Find uses some kind of stop list (list of common functional words with little semantic content like "the, a, and, it" etc)?
At a glance, the older system doesn't seem to filter these out. These words are among the most commonly used, and filtering them out of results can significantly reduce the amount of data needing processing without having much impact on the effectiveness of a search.

Andy Caiger

Tuesday 18 May 2010 1:28:41 am

Although using eZ Find has been recommended, it's quite a bit of work to get it working. It does seem that eZ Publish 4 does not index PDFs. Can anyone explain how to get it working without installing eZ Find? I'm using eZ Publish 4.2.

EAB - Integrated Internet Success
Offices in England, France & China.
http://www.eab.co.uk http://www.eab-china.com http://www.eab-france.com

Gaetano Giunta

Tuesday 18 May 2010 9:51:58 am

All you need to do is to edit the

 [PDFHandlerSettings]

TextExtractionTool=pstotext

block in binaryfile.ini.

I'd recommend to substitute pstotext with the name of a cli script you have written. That script can simply echo to a log file the current date and the parameters it receives (1st one is the path to the pdf file to be converted to plain text).

This will get you started with debugging

Principal Consultant International Business
Member of the Community Project Board

Andy Caiger

Tuesday 18 May 2010 6:55:21 pm

Thanks! This is a great idea and helped me solve the problem quickly, together with advice given at http://ez.no/ezpublish/documentation/configuration/optimization/speeding_up_acrobat_pdf_document_indexing_

:-)

EAB - Integrated Internet Success
Offices in England, France & China.
http://www.eab.co.uk http://www.eab-china.com http://www.eab-france.com

eZ debug

Timing: Jan 31 2025 00:31:03
Script start
Timing: Jan 31 2025 00:31:03
Module start 'content'
Timing: Jan 31 2025 00:31:03
Module end 'content'
Timing: Jan 31 2025 00:31:03
Script end

Main resources:

Total runtime0.1866 sec
Peak memory usage8,192.0000 KB
Database Queries141

Timing points:

CheckpointStart (sec)Duration (sec)Memory at start (KB)Memory used (KB)
Script start 0.00000.0075 587.9219370.3047
Module start 'content' 0.00750.0129 958.22661,009.6406
Module end 'content' 0.02030.1662 1,967.86723,897.2656
Script end 0.1865  5,865.1328 

Time accumulators:

 Accumulator Duration (sec) Duration (%) Count Average (sec)
Ini load
Load cache0.00422.2373200.0002
Check MTime0.00130.6914200.0001
Mysql Total
Database connection0.00080.412110.0008
Mysqli_queries0.099453.23621410.0007
Looping result0.00120.66191390.0000
Template Total0.165788.810.1657
Template load0.00080.452710.0008
Template processing0.164988.350710.1649
Override
Cache load0.00060.301210.0006
Sytem overhead
Fetch class attribute can translate value0.00160.842810.0016
XML
Image XML parsing0.00030.136710.0003
General
dbfile0.00663.5513200.0003
String conversion0.00000.003730.0000
Note: percentages do not add up to 100% because some accumulators overlap

CSS/JS files loaded with "ezjscPacker" during request:

CacheTypePacklevelSourceFiles
CSS0extension/community/design/community/stylesheets/ext/jquery.autocomplete.css
extension/community_design/design/suncana/stylesheets/scrollbars.css
extension/community_design/design/suncana/stylesheets/tabs.css
extension/community_design/design/suncana/stylesheets/roadmap.css
extension/community_design/design/suncana/stylesheets/content.css
extension/community_design/design/suncana/stylesheets/star-rating.css
extension/community_design/design/suncana/stylesheets/syntax_and_custom_tags.css
extension/community_design/design/suncana/stylesheets/buttons.css
extension/community_design/design/suncana/stylesheets/tweetbox.css
extension/community_design/design/suncana/stylesheets/jquery.fancybox-1.3.4.css
extension/bcsmoothgallery/design/standard/stylesheets/magnific-popup.css
extension/sevenx/design/simple/stylesheets/star_rating.css
extension/sevenx/design/simple/stylesheets/libs/fontawesome/css/all.min.css
extension/sevenx/design/simple/stylesheets/main.v02.css
extension/sevenx/design/simple/stylesheets/main.v02.res.css
JS0extension/ezjscore/design/standard/lib/yui/3.17.2/build/yui/yui-min.js
extension/ezjscore/design/standard/javascript/jquery-3.7.0.min.js
extension/community_design/design/suncana/javascript/jquery.ui.core.min.js
extension/community_design/design/suncana/javascript/jquery.ui.widget.min.js
extension/community_design/design/suncana/javascript/jquery.easing.1.3.js
extension/community_design/design/suncana/javascript/jquery.ui.tabs.js
extension/community_design/design/suncana/javascript/jquery.hoverIntent.min.js
extension/community_design/design/suncana/javascript/jquery.popmenu.js
extension/community_design/design/suncana/javascript/jScrollPane.js
extension/community_design/design/suncana/javascript/jquery.mousewheel.js
extension/community_design/design/suncana/javascript/jquery.cycle.all.js
extension/sevenx/design/simple/javascript/jquery.scrollTo.js
extension/community_design/design/suncana/javascript/jquery.cookie.js
extension/community_design/design/suncana/javascript/ezstarrating_jquery.js
extension/community_design/design/suncana/javascript/jquery.initboxes.js
extension/community_design/design/suncana/javascript/app.js
extension/community_design/design/suncana/javascript/twitterwidget.js
extension/community_design/design/suncana/javascript/community.js
extension/community_design/design/suncana/javascript/roadmap.js
extension/community_design/design/suncana/javascript/ez.js
extension/community_design/design/suncana/javascript/ezshareevents.js
extension/sevenx/design/simple/javascript/main.js

Templates used to render the page:

UsageRequested templateTemplateTemplate loadedEditOverride
1pagelayout.tpl<No override>extension/sevenx/design/simple/templates/pagelayout.tplEdit templateOverride template
 Number of times templates used: 1
 Number of unique templates used: 1

Time used to render debug report: 0.0001 secs