PDF indexing

«

Previous topic

|

Install & configuration

|

Next topic

»

Author	Message
Michael Hall	Wednesday 05 March 2008 4:15:08 am Does EZ4's search indexing capability include indexing PDF files?
Tony Wood	Wednesday 05 March 2008 5:06:47 am For this I would recommend using eZ Find. It is based on Lucene so does a very good job. Tony Wood : twitter.com/tonywood Vision with Technology Experts in eZ Publish consulting & development Power to the Editor! Free eZ Training : http://www.VisionWT.com/training eZ Future Podcast : http://www.VisionWT.com/eZ-Future
Michael Hall	Wednesday 05 March 2008 4:40:35 pm OK, well I got the old PDF indexing working using pdftotext ... much the same mechanism as good old htdig. eZ Find seems to be the way to go, even though it means adding a JRE to the server to run it. We'll be indexing quite large PDF files (historical newspaper collection). I'm yet to run a trial, but I'm wondering how eZFind works at the database end and whether potentially very large chunks of data will cause problems, as seemed to happen with the old system? Also, does anyone know if eZ Find uses some kind of stop list (list of common functional words with little semantic content like "the, a, and, it" etc)? At a glance, the older system doesn't seem to filter these out. These words are among the most commonly used, and filtering them out of results can significantly reduce the amount of data needing processing without having much impact on the effectiveness of a search.
Andy Caiger	Tuesday 18 May 2010 1:28:41 am Although using eZ Find has been recommended, it's quite a bit of work to get it working. It does seem that eZ Publish 4 does not index PDFs. Can anyone explain how to get it working without installing eZ Find? I'm using eZ Publish 4.2. EAB - Integrated Internet Success Offices in England, France & China. http://www.eab.co.uk http://www.eab-china.com http://www.eab-france.com
Gaetano Giunta	Tuesday 18 May 2010 9:51:58 am All you need to do is to edit the [PDFHandlerSettings] TextExtractionTool=pstotext block in binaryfile.ini. I'd recommend to substitute pstotext with the name of a cli script you have written. That script can simply echo to a log file the current date and the parameters it receives (1st one is the path to the pdf file to be converted to plain text). This will get you started with debugging Principal Consultant International Business Member of the Community Project Board
Andy Caiger	Tuesday 18 May 2010 6:55:21 pm Thanks! This is a great idea and helped me solve the problem quickly, together with advice given at http://ez.no/ezpublish/documentation/configuration/optimization/speeding_up_acrobat_pdf_document_indexing_ :-) EAB - Integrated Internet Success Offices in England, France & China. http://www.eab.co.uk http://www.eab-china.com http://www.eab-france.com

Powered by eZ Publish™ CMS Open Source Web Content Management. Copyright © 1999-2014 eZ Systems AS (except where otherwise noted). All rights reserved.

eZ debug

Timing:	Jan 30 2025 23:03:38
Script start
Timing:	Jan 30 2025 23:03:38
Module start 'layout'
Timing:	Jan 30 2025 23:03:38
Module start 'content'
Timing:	Jan 30 2025 23:03:38
Module end 'content'
Timing:	Jan 30 2025 23:03:38
Script end

Main resources:

Total runtime	0.0275 sec
Peak memory usage	4,096.0000 KB
Database Queries	3

Timing points:

Checkpoint	Start (sec)	Duration (sec)	Memory at start (KB)	Memory used (KB)
Script start	0.0000	0.0056	588.1250	151.2109
Module start 'layout'	0.0056	0.0040	739.3359	220.6875
Module start 'content'	0.0096	0.0161	960.0234	1,005.9141
Module end 'content'	0.0257	0.0017	1,965.9375	37.9922
Script end	0.0274		2,003.9297

Time accumulators:

Accumulator	Duration (sec)	Duration (%)	Count	Average (sec)
Ini load
Load cache	0.0026	9.5838	14	0.0002
Check MTime	0.0011	3.8594	14	0.0001
Mysql Total
Database connection	0.0012	4.2516	1	0.0012
Mysqli_queries	0.0023	8.4278	3	0.0008
Looping result	0.0000	0.0477	1	0.0000
Template Total	0.0011	4.1	1	0.0011
Template load	0.0009	3.3308	1	0.0009
Template processing	0.0002	0.7793	1	0.0002
Override
Cache load	0.0006	2.2894	1	0.0006
General
dbfile	0.0046	16.6473	8	0.0006
String conversion	0.0000	0.0330	4	0.0000
Note: percentages do not add up to 100% because some accumulators overlap

Templates used to render the page:

Usage	Requested template	Template	Template loaded	Edit	Override
1	print_pagelayout.tpl	<No override>	extension/community/design/community/templates/print_pagelayout.tpl
Number of times templates used: 1 Number of unique templates used: 1

Time used to render debug report: 0.0001 secs