Forums / General / pdf search

pdf search

Author Message

Carlos Revillo

Thursday 04 May 2006 4:02:51 am

hi. i'm trying to index pdf documents i've uploaded through admin interface, but i cannot make it work.

I've installed pdftotex in my server and its working.

Next, i've created a file called ezpdftotext with this content.

#!/bin/sh
#ezpdftotext script
pdftotext $1 -

and i've created a file (binaryfile.ini.append.php) in my /settings/override folder with this content.

# Here you can add handlers for new datatypes.
[HandlerSettings]
MetaDataExtractor[text/plain]=plaintext
MetaDataExtractor[application/pdf]=pdf

# The path to the text extraction tool to use to 
# fetch the information in PDF files
[PDFHandlerSettings]
TextExtractionTool=ezpdftotext

I've also made file searchable in my class.

If i upload a plain txt file all is well. Ez read words of my file and do the things needed in ezsearch_word and related tables, but nothing happens when i upload a pdf file.

Any help would be very appreciated. thanks.

Nicolas Frey

Thursday 02 April 2009 3:18:15 am

Hi,

I have the same problem. Someone have a solution ?

I'm using ezpublish 4.1.

My binaryfile.ini

[HandlerSettings]
MetaDataExtractor[text/plain]=ezplaintext
MetaDataExtractor[application/pdf]=ezpdf
MetaDataExtractor[application/msword]=ezword

[PDFHandlerSettings]
TextExtractionTool=pdftotext.bat

My pdftotext.bat

pdftotext -enc UTF-8 %1

In my class, I was check "searchable" for file.

If I look in \var\site\storage\original\application after launch updatesearchindex.php or upload file, a text-file was create with the good content.

6fb8fdcc583cf155d3aeb82e289c0b31.pdf
6fb8fdcc583cf155d3aeb82e289c0b31.txt

In the search result, only text files are shown..

An idea ?

Thanks.

Nicolas Frey (2ST)

Damien Pobel

Thursday 02 April 2009 3:56:17 am

Hi,

You should try to enable DebugOutput and DebugRedirection to see if something went wrong.
In addition, I think you should put the full path to your script in binaryfile.ini.append.php and perhaps try to launch your script on the original PDF file in a shell to check if it's able to extract the text from your PDF, with some weird PDF files, sometimes it fails.

Damien
Planet eZ Publish.fr : http://www.planet-ezpublish.fr
Certification : http://auth.ez.no/certification/verify/372448
Publications about eZ Publish : http://pwet.fr/tags/keywords/weblog/ez_publish

Nicolas Frey

Thursday 02 April 2009 6:38:55 am

I found the problem.

class eZPDFParser
{
    function parseFile( $fileName )
    {
        $binaryINI = eZINI::instance( 'binaryfile.ini' );

        $textExtractionTool = $binaryINI->variable( 'PDFHandlerSettings', 'TextExtractionTool' );

        // save the buffer contents
        $buffer = ob_get_contents();
        ob_end_clean();

        // fetch the module printout
        ob_start();
        passthru( "$textExtractionTool $fileName" );
        $metaData = ob_get_contents();
        ob_end_clean();

        // fill the buffer with the old values
        ob_start();
        print( $buffer );

        return $metaData;
    }
}

This class runs the script contained in "binaryfile.ini" and retrieves the output stream for the search indexing.
There is no help in pdftotext, which tells how to make a direct result. After some research, I found this command:

PDFtoText.exe filename.pdf -

Nicolas Frey (2ST)

Johann Lemaitre

Thursday 09 April 2009 7:57:09 am

Hi,

I followed all points specified in this topic.
So I changed my binary.ini file and it looks like :

[PDFHandlerSettings]
TextExtractionTool=/var/www/ez/xpdf-3.02/xpdf/pdftotext -enc UTF-8

Just for a test, I have added a dash "-" after le filename in the file "ezpdfparser.php"

passthru( "$textExtractionTool $fileName -" );

This dash modify the pdf conversion because now I have no generated text file (Everything is send to the stdout).

Finally, my search result with ezFind 2.0.0 is always empty.

Could you help me ?
thanks

Johann

eZ debug

Timing: Jan 31 2025 00:36:44
Script start
Timing: Jan 31 2025 00:36:44
Module start 'content'
Timing: Jan 31 2025 00:36:44
Module end 'content'
Timing: Jan 31 2025 00:36:44
Script end

Main resources:

Total runtime0.1917 sec
Peak memory usage8,192.0000 KB
Database Queries141

Timing points:

CheckpointStart (sec)Duration (sec)Memory at start (KB)Memory used (KB)
Script start 0.00000.0076 587.7422370.3203
Module start 'content' 0.00760.0123 958.06251,005.5547
Module end 'content' 0.01990.1717 1,963.61723,894.3750
Script end 0.1916  5,857.9922 

Time accumulators:

 Accumulator Duration (sec) Duration (%) Count Average (sec)
Ini load
Load cache0.00442.3061200.0002
Check MTime0.00140.7484200.0001
Mysql Total
Database connection0.00070.377210.0007
Mysqli_queries0.099651.97091410.0007
Looping result0.00130.68671390.0000
Template Total0.171289.310.1712
Template load0.00080.399010.0008
Template processing0.170588.919710.1705
Override
Cache load0.00050.270110.0005
Sytem overhead
Fetch class attribute can translate value0.00170.862310.0017
XML
Image XML parsing0.00030.137810.0003
General
dbfile0.00683.5646200.0003
String conversion0.00000.003130.0000
Note: percentages do not add up to 100% because some accumulators overlap

CSS/JS files loaded with "ezjscPacker" during request:

CacheTypePacklevelSourceFiles
CSS0extension/community/design/community/stylesheets/ext/jquery.autocomplete.css
extension/community_design/design/suncana/stylesheets/scrollbars.css
extension/community_design/design/suncana/stylesheets/tabs.css
extension/community_design/design/suncana/stylesheets/roadmap.css
extension/community_design/design/suncana/stylesheets/content.css
extension/community_design/design/suncana/stylesheets/star-rating.css
extension/community_design/design/suncana/stylesheets/syntax_and_custom_tags.css
extension/community_design/design/suncana/stylesheets/buttons.css
extension/community_design/design/suncana/stylesheets/tweetbox.css
extension/community_design/design/suncana/stylesheets/jquery.fancybox-1.3.4.css
extension/bcsmoothgallery/design/standard/stylesheets/magnific-popup.css
extension/sevenx/design/simple/stylesheets/star_rating.css
extension/sevenx/design/simple/stylesheets/libs/fontawesome/css/all.min.css
extension/sevenx/design/simple/stylesheets/main.v02.css
extension/sevenx/design/simple/stylesheets/main.v02.res.css
JS0extension/ezjscore/design/standard/lib/yui/3.17.2/build/yui/yui-min.js
extension/ezjscore/design/standard/javascript/jquery-3.7.0.min.js
extension/community_design/design/suncana/javascript/jquery.ui.core.min.js
extension/community_design/design/suncana/javascript/jquery.ui.widget.min.js
extension/community_design/design/suncana/javascript/jquery.easing.1.3.js
extension/community_design/design/suncana/javascript/jquery.ui.tabs.js
extension/community_design/design/suncana/javascript/jquery.hoverIntent.min.js
extension/community_design/design/suncana/javascript/jquery.popmenu.js
extension/community_design/design/suncana/javascript/jScrollPane.js
extension/community_design/design/suncana/javascript/jquery.mousewheel.js
extension/community_design/design/suncana/javascript/jquery.cycle.all.js
extension/sevenx/design/simple/javascript/jquery.scrollTo.js
extension/community_design/design/suncana/javascript/jquery.cookie.js
extension/community_design/design/suncana/javascript/ezstarrating_jquery.js
extension/community_design/design/suncana/javascript/jquery.initboxes.js
extension/community_design/design/suncana/javascript/app.js
extension/community_design/design/suncana/javascript/twitterwidget.js
extension/community_design/design/suncana/javascript/community.js
extension/community_design/design/suncana/javascript/roadmap.js
extension/community_design/design/suncana/javascript/ez.js
extension/community_design/design/suncana/javascript/ezshareevents.js
extension/sevenx/design/simple/javascript/main.js

Templates used to render the page:

UsageRequested templateTemplateTemplate loadedEditOverride
1pagelayout.tpl<No override>extension/sevenx/design/simple/templates/pagelayout.tplEdit templateOverride template
 Number of times templates used: 1
 Number of unique templates used: 1

Time used to render debug report: 0.0001 secs