PDF files not indexed

Author Message

Jeroen Sangers

Thursday 29 June 2006 1:54:44 am

I am trying to include the contents of PDF files in the search index, but cannot get it to work.

I installed pstotext on my server, and tested it with a PDF file. I followed the steps as layed out in http://ez.no/products/ez_publish/documentation/configuration/configuration/search_engine/configuring_binary_file_indexing, and uploaded a PDF file to my site. However, when I search for some words in that file, no results show up.

Is there any way I can turn on logging/auditing to see what is happening when I upload a PDF file?

Siniša Šehović

Thursday 29 June 2006 11:20:51 pm

Hi Jeroen

I have the same problem on eZ 3.8.2.

Can anyone help us here? :-)

Best regards,
S.

---
If at first you don't succeed, look in the trash for the instructions.

Jeroen Sangers

Friday 30 June 2006 8:17:04 am

I still can't get it to work. I have tried moving around pstotext all over my server, I switched to pdftotext, I specified the full path to pstotext in my binaryfile.ini.append.php, but always I receive the same error:

Plugin for application/pdf was not found

Does anybody have a clue on how I can solve this?

Siniša Šehović

Saturday 01 July 2006 4:18:09 am

Hi Jeroen

What happend if you try to execute pstotext from linux shell?

Do you get any errors?

Did you try this aproach?
http://ez.no/community/forum/setup_design/indexing_binary_files_excel_and_powerpoint

S.

---
If at first you don't succeed, look in the trash for the instructions.

Jeroen Sangers

Monday 03 July 2006 1:33:13 am

I managed to solve it this weekend. There were two problems, and in the various configurations I have tried, always one of them appeared, until I tried the right combination!

The first problem is a mistake in the documentation. http://ez.no/products/ez_publish/documentation/configuration/configuration/search_engine/configuring_binary_file_indexing mentioned the following code:

[HandlerSettings]
MetaDataExtractor[application/pdf]=pdf

I copied that setting to my binaryfile.ini file, effectively destroying PDF parsing. Of course, I should have left it at the default value:

[HandlerSettings]
MetaDataExtractor[application/pdf]=ezpdf

The second problem I had was related to pdftotext. I've found out that the command used by eZ Publish (pdftotext example.pdf) does not produce any output. To get this to work, I had to modify kernel/classes/datatypes/ezbinaryfile/plugins/ezpdfparser.php:

passthru( "$textExtractionTool $fileName -" );

Siniša Šehović

Tuesday 04 July 2006 2:57:18 am

Hi Jeroen

Thanx for tip!

Now I can index my PDFs.

Best regards,
S.

---
If at first you don't succeed, look in the trash for the instructions.

Powered by eZ Publish™ CMS Open Source Web Content Management. Copyright © 1999-2014 eZ Systems AS (except where otherwise noted). All rights reserved.

eZ debug

Timing: Jan 19 2025 07:20:50
Script start
Timing: Jan 19 2025 07:20:50
Module start 'layout'
Timing: Jan 19 2025 07:20:50
Module start 'content'
Timing: Jan 19 2025 07:20:50
Module end 'content'
Timing: Jan 19 2025 07:20:50
Script end

Main resources:

Total runtime0.0164 sec
Peak memory usage2,048.0000 KB
Database Queries3

Timing points:

CheckpointStart (sec)Duration (sec)Memory at start (KB)Memory used (KB)
Script start 0.00000.0055 589.1563152.6250
Module start 'layout' 0.00550.0034 741.781339.4609
Module start 'content' 0.00890.0054 781.242297.4609
Module end 'content' 0.01430.0021 878.703138.3047
Script end 0.0164  917.0078 

Time accumulators:

 Accumulator Duration (sec) Duration (%) Count Average (sec)
Ini load
Load cache0.002615.9710140.0002
Check MTime0.00116.8364140.0001
Mysql Total
Database connection0.00074.181410.0007
Mysqli_queries0.003118.823430.0010
Looping result0.00000.098710.0000
Template Total0.001610.010.0016
Template load0.00095.671410.0009
Template processing0.00074.252410.0007
Override
Cache load0.00063.912910.0006
General
dbfile0.00116.472380.0001
String conversion0.00000.072540.0000
Note: percentages do not add up to 100% because some accumulators overlap

Templates used to render the page:

UsageRequested templateTemplateTemplate loadedEditOverride
1print_pagelayout.tpl<No override>extension/community/design/community/templates/print_pagelayout.tplEdit templateOverride template
 Number of times templates used: 1
 Number of unique templates used: 1

Time used to render debug report: 0.0001 secs