pdf search

Author Message

Carlos Revillo

Thursday 04 May 2006 4:02:51 am

hi. i'm trying to index pdf documents i've uploaded through admin interface, but i cannot make it work.

I've installed pdftotex in my server and its working.

Next, i've created a file called ezpdftotext with this content.

#!/bin/sh
#ezpdftotext script
pdftotext $1 -

and i've created a file (binaryfile.ini.append.php) in my /settings/override folder with this content.

# Here you can add handlers for new datatypes.
[HandlerSettings]
MetaDataExtractor[text/plain]=plaintext
MetaDataExtractor[application/pdf]=pdf

# The path to the text extraction tool to use to 
# fetch the information in PDF files
[PDFHandlerSettings]
TextExtractionTool=ezpdftotext

I've also made file searchable in my class.

If i upload a plain txt file all is well. Ez read words of my file and do the things needed in ezsearch_word and related tables, but nothing happens when i upload a pdf file.

Any help would be very appreciated. thanks.

Nicolas Frey

Thursday 02 April 2009 3:18:15 am

Hi,

I have the same problem. Someone have a solution ?

I'm using ezpublish 4.1.

My binaryfile.ini

[HandlerSettings]
MetaDataExtractor[text/plain]=ezplaintext
MetaDataExtractor[application/pdf]=ezpdf
MetaDataExtractor[application/msword]=ezword

[PDFHandlerSettings]
TextExtractionTool=pdftotext.bat

My pdftotext.bat

pdftotext -enc UTF-8 %1

In my class, I was check "searchable" for file.

If I look in \var\site\storage\original\application after launch updatesearchindex.php or upload file, a text-file was create with the good content.

6fb8fdcc583cf155d3aeb82e289c0b31.pdf
6fb8fdcc583cf155d3aeb82e289c0b31.txt

In the search result, only text files are shown..

An idea ?

Thanks.

Nicolas Frey (2ST)

Damien Pobel

Thursday 02 April 2009 3:56:17 am

Hi,

You should try to enable DebugOutput and DebugRedirection to see if something went wrong.
In addition, I think you should put the full path to your script in binaryfile.ini.append.php and perhaps try to launch your script on the original PDF file in a shell to check if it's able to extract the text from your PDF, with some weird PDF files, sometimes it fails.

Damien
Planet eZ Publish.fr : http://www.planet-ezpublish.fr
Certification : http://auth.ez.no/certification/verify/372448
Publications about eZ Publish : http://pwet.fr/tags/keywords/weblog/ez_publish

Nicolas Frey

Thursday 02 April 2009 6:38:55 am

I found the problem.

class eZPDFParser
{
    function parseFile( $fileName )
    {
        $binaryINI = eZINI::instance( 'binaryfile.ini' );

        $textExtractionTool = $binaryINI->variable( 'PDFHandlerSettings', 'TextExtractionTool' );

        // save the buffer contents
        $buffer = ob_get_contents();
        ob_end_clean();

        // fetch the module printout
        ob_start();
        passthru( "$textExtractionTool $fileName" );
        $metaData = ob_get_contents();
        ob_end_clean();

        // fill the buffer with the old values
        ob_start();
        print( $buffer );

        return $metaData;
    }
}

This class runs the script contained in "binaryfile.ini" and retrieves the output stream for the search indexing.
There is no help in pdftotext, which tells how to make a direct result. After some research, I found this command:

PDFtoText.exe filename.pdf -

Nicolas Frey (2ST)

Johann Lemaitre

Thursday 09 April 2009 7:57:09 am

Hi,

I followed all points specified in this topic.
So I changed my binary.ini file and it looks like :

[PDFHandlerSettings]
TextExtractionTool=/var/www/ez/xpdf-3.02/xpdf/pdftotext -enc UTF-8

Just for a test, I have added a dash "-" after le filename in the file "ezpdfparser.php"

passthru( "$textExtractionTool $fileName -" );

This dash modify the pdf conversion because now I have no generated text file (Everything is send to the stdout).

Finally, my search result with ezFind 2.0.0 is always empty.

Could you help me ?
thanks

Johann

Powered by eZ Publish™ CMS Open Source Web Content Management. Copyright © 1999-2014 eZ Systems AS (except where otherwise noted). All rights reserved.

eZ debug

Timing: Jan 31 2025 04:30:27
Script start
Timing: Jan 31 2025 04:30:27
Module start 'layout'
Timing: Jan 31 2025 04:30:27
Module start 'content'
Timing: Jan 31 2025 04:30:28
Module end 'content'
Timing: Jan 31 2025 04:30:28
Script end

Main resources:

Total runtime1.1491 sec
Peak memory usage4,096.0000 KB
Database Queries67

Timing points:

CheckpointStart (sec)Duration (sec)Memory at start (KB)Memory used (KB)
Script start 0.00000.0059 588.0000151.1953
Module start 'layout' 0.00600.0028 739.195336.5938
Module start 'content' 0.00881.1395 775.78911,043.0938
Module end 'content' 1.14830.0008 1,818.882815.8906
Script end 1.1491  1,834.7734 

Time accumulators:

 Accumulator Duration (sec) Duration (%) Count Average (sec)
Ini load
Load cache0.00300.2621160.0002
Check MTime0.00120.1049160.0001
Mysql Total
Database connection0.00080.071310.0008
Mysqli_queries1.075793.6114670.0161
Looping result0.00060.0526650.0000
Template Total1.113696.920.5568
Template load0.00180.155920.0009
Template processing1.111896.752420.5559
Template load and register function0.00130.113310.0013
states
state_id_array0.00230.204110.0023
state_identifier_array0.00170.147220.0008
Override
Cache load0.00160.1380510.0000
Sytem overhead
Fetch class attribute can translate value0.00080.071040.0002
Fetch class attribute name0.00200.170270.0003
XML
Image XML parsing0.00450.389840.0011
class_abstraction
Instantiating content class attribute0.00000.001480.0000
General
dbfile0.00600.5230210.0003
String conversion0.00000.000640.0000
Note: percentages do not add up to 100% because some accumulators overlap

Templates used to render the page:

UsageRequested templateTemplateTemplate loadedEditOverride
1node/view/full.tplfull/forum_topic.tplextension/sevenx/design/simple/override/templates/full/forum_topic.tplEdit templateOverride template
5content/datatype/view/ezxmltext.tpl<No override>extension/community_design/design/suncana/templates/content/datatype/view/ezxmltext.tplEdit templateOverride template
16content/datatype/view/ezxmltags/paragraph.tpl<No override>extension/ezwebin/design/ezwebin/templates/content/datatype/view/ezxmltags/paragraph.tplEdit templateOverride template
8content/datatype/view/ezxmltags/literal.tpl<No override>extension/community/design/standard/templates/content/datatype/view/ezxmltags/literal.tplEdit templateOverride template
3content/datatype/view/ezimage.tpl<No override>extension/sevenx/design/simple/templates/content/datatype/view/ezimage.tplEdit templateOverride template
4content/datatype/view/ezxmltags/line.tpl<No override>design/standard/templates/content/datatype/view/ezxmltags/line.tplEdit templateOverride template
1print_pagelayout.tpl<No override>extension/community/design/community/templates/print_pagelayout.tplEdit templateOverride template
 Number of times templates used: 38
 Number of unique templates used: 7

Time used to render debug report: 0.0001 secs