Forums / Developer / Searching content into a pdf file

Searching content into a pdf file

Author Message

Simone Conti

Monday 01 August 2011 7:34:00 am

Hi there, 

I have a lot of pdf files in my site and I need to enable EzPublish look for a specific content into those pdf files, in order to list the results together with the standard search results.

Is that possible? Is there any extension or something else to enable to make it possible?

Thank you so much!

Peter Keung

Monday 01 August 2011 7:51:40 am

This is the first thing that comes to mind: http://projects.ez.no/eztika

http://www.mugo.ca
Mugo Web, eZ Partner in Vancouver, Canada

Simone Conti

Monday 01 August 2011 8:05:24 am

Unfortunately is not what I'm looking for.

I need something that allows me to search into a pdf files. Somebody told me that EzPublish has this feature embedded but it needs to be allowed.

Any suggestions?

Steven E. Bailey

Monday 01 August 2011 9:07:18 am

eZPublish does have this feature and you should be seeing your pdfs indexed - with a bunch of caveats.

What happens is that when a pdf is saved (or you update your search index), the pdf is run through the tool defined by 

[PDFHandlerSettings]
TextExtractionTool=pstotext

in your binaryfile.ini

If you don't have this tool on your machine, then your pdfs won't be indexed. 

If you search for TextExtractionTool or pdftotext in these forums you'll see a couple other possible tools - such as:

http://share.ez.no/forums/extensions/ez-find/solr-indexing-error

If you have whatever tool you are using and if you're pdfs aren't being indexed, then it probably means that your pdfs aren't structurally text - the content is actually an image (or series of images) saved in the pdf container.  It means that you're not going to be able to index using pdftotext - a good test is to run whatever tool you have on the command line against the file that isn't be indexed to see what actually comes out.  If nothing comes out you'll have to use some other tool - like eztika (I've never used it) or, something like tesseract to extract the text.

Certified eZPublish developer
http://ez.no/certification/verify/396111

Available for ezpublish troubleshooting, hosting and custom extension development: http://www.leidentech.com

Simone Conti

Thursday 04 August 2011 3:26:58 am

Now something works.

I decided to use eztika as suggested by Peter.

I have a question: where does eztika store its data? I hope it's not made to scan all pdf for each search... I have a very large number of pdf files!!

 

Thanks

Paul Borgermans

Friday 05 August 2011 10:23:26 am

Hi

eztika does not store the data itself, its goal is to extract the plain text for subsequent indexing by the configured search plugin (you should use eZ Find of course :) )

the default search plugin stores the indexing result in the database, while eZ Find uses Solr which stores its data into Lucene index files on teh filesystem

This is done only when the pdf is uploaded or updated.

hth

Paul

eZ Publish, eZ Find, Solr expert consulting and training
http://twitter.com/paulborgermans

eZ debug

Timing: Jan 17 2025 23:51:08
Script start
Timing: Jan 17 2025 23:51:08
Module start 'content'
Timing: Jan 17 2025 23:51:09
Module end 'content'
Timing: Jan 17 2025 23:51:09
Script end

Main resources:

Total runtime0.8535 sec
Peak memory usage4,096.0000 KB
Database Queries209

Timing points:

CheckpointStart (sec)Duration (sec)Memory at start (KB)Memory used (KB)
Script start 0.00000.0051 588.8516180.8359
Module start 'content' 0.00510.7489 769.6875665.6094
Module end 'content' 0.75400.0994 1,435.2969337.0547
Script end 0.8534  1,772.3516 

Time accumulators:

 Accumulator Duration (sec) Duration (%) Count Average (sec)
Ini load
Load cache0.00360.4195210.0002
Check MTime0.00140.1627210.0001
Mysql Total
Database connection0.00060.074510.0006
Mysqli_queries0.786492.14712090.0038
Looping result0.00170.20252070.0000
Template Total0.832397.520.4162
Template load0.00200.232520.0010
Template processing0.830397.288220.4152
Template load and register function0.00020.018210.0002
states
state_id_array0.00050.061510.0005
state_identifier_array0.00060.075720.0003
Override
Cache load0.00160.1864410.0000
Sytem overhead
Fetch class attribute can translate value0.00120.140050.0002
Fetch class attribute name0.00080.098490.0001
XML
Image XML parsing0.00160.187950.0003
class_abstraction
Instantiating content class attribute0.00000.002090.0000
General
dbfile0.00190.2212420.0000
String conversion0.00000.000630.0000
Note: percentages do not add up to 100% because some accumulators overlap

CSS/JS files loaded with "ezjscPacker" during request:

CacheTypePacklevelSourceFiles
CSS0extension/community/design/community/stylesheets/ext/jquery.autocomplete.css
extension/community_design/design/suncana/stylesheets/scrollbars.css
extension/community_design/design/suncana/stylesheets/tabs.css
extension/community_design/design/suncana/stylesheets/roadmap.css
extension/community_design/design/suncana/stylesheets/content.css
extension/community_design/design/suncana/stylesheets/star-rating.css
extension/community_design/design/suncana/stylesheets/syntax_and_custom_tags.css
extension/community_design/design/suncana/stylesheets/buttons.css
extension/community_design/design/suncana/stylesheets/tweetbox.css
extension/community_design/design/suncana/stylesheets/jquery.fancybox-1.3.4.css
extension/bcsmoothgallery/design/standard/stylesheets/magnific-popup.css
extension/sevenx/design/simple/stylesheets/star_rating.css
extension/sevenx/design/simple/stylesheets/libs/fontawesome/css/all.min.css
extension/sevenx/design/simple/stylesheets/main.v02.css
extension/sevenx/design/simple/stylesheets/main.v02.res.css
JS0extension/ezjscore/design/standard/lib/yui/3.17.2/build/yui/yui-min.js
extension/ezjscore/design/standard/javascript/jquery-3.7.0.min.js
extension/community_design/design/suncana/javascript/jquery.ui.core.min.js
extension/community_design/design/suncana/javascript/jquery.ui.widget.min.js
extension/community_design/design/suncana/javascript/jquery.easing.1.3.js
extension/community_design/design/suncana/javascript/jquery.ui.tabs.js
extension/community_design/design/suncana/javascript/jquery.hoverIntent.min.js
extension/community_design/design/suncana/javascript/jquery.popmenu.js
extension/community_design/design/suncana/javascript/jScrollPane.js
extension/community_design/design/suncana/javascript/jquery.mousewheel.js
extension/community_design/design/suncana/javascript/jquery.cycle.all.js
extension/sevenx/design/simple/javascript/jquery.scrollTo.js
extension/community_design/design/suncana/javascript/jquery.cookie.js
extension/community_design/design/suncana/javascript/ezstarrating_jquery.js
extension/community_design/design/suncana/javascript/jquery.initboxes.js
extension/community_design/design/suncana/javascript/app.js
extension/community_design/design/suncana/javascript/twitterwidget.js
extension/community_design/design/suncana/javascript/community.js
extension/community_design/design/suncana/javascript/roadmap.js
extension/community_design/design/suncana/javascript/ez.js
extension/community_design/design/suncana/javascript/ezshareevents.js
extension/sevenx/design/simple/javascript/main.js

Templates used to render the page:

UsageRequested templateTemplateTemplate loadedEditOverride
1node/view/full.tplfull/forum_topic.tplextension/sevenx/design/simple/override/templates/full/forum_topic.tplEdit templateOverride template
6content/datatype/view/ezxmltext.tpl<No override>extension/community_design/design/suncana/templates/content/datatype/view/ezxmltext.tplEdit templateOverride template
8content/datatype/view/ezxmltags/paragraph.tpl<No override>extension/ezwebin/design/ezwebin/templates/content/datatype/view/ezxmltags/paragraph.tplEdit templateOverride template
3content/datatype/view/ezimage.tpl<No override>extension/sevenx/design/simple/templates/content/datatype/view/ezimage.tplEdit templateOverride template
2content/datatype/view/ezxmltags/link.tpl<No override>design/standard/templates/content/datatype/view/ezxmltags/link.tplEdit templateOverride template
1content/datatype/view/ezxmltags/line.tpl<No override>design/standard/templates/content/datatype/view/ezxmltags/line.tplEdit templateOverride template
1pagelayout.tpl<No override>extension/sevenx/design/simple/templates/pagelayout.tplEdit templateOverride template
 Number of times templates used: 22
 Number of unique templates used: 7

Time used to render debug report: 0.0001 secs