Ez Find don't index PDF with special chars like "ç"

Author Message

eric figo

Wednesday 19 November 2008 6:12:14 am

Hi,

I'm using EZ Find and pstotext in order to indexing PDF files.
Some are indexed and some not.

I tried many files and for exemple a pdf with this text single text is indexed : "Website un test d’indexation pour voir si ca marche ….. hsdhhdhhd"

But if i change the "c" for à "ç" like "Website un test d’indexation pour voir si ça marche ….. hsdhhdhhd", the pdf is not indexed.

Any ideas ? My database is in UTF 8, and i don't change the configurtaion of charset in EZ Publish.

Thanks for your responses

Paul Borgermans

Wednesday 26 November 2008 10:59:20 am

pstotext is not the best solution for converting pdf's to raw text, I guess that it fails to onvert the pdf file in question (try on the command line to se what happens)

Better is to use pdftotext from the xpdf project, then configure a new script, for example called ezpdftotext with the following content (change the path tp pdftotext with your installation):

#!/bin/sh
<path to >/pdftotext -enc "UTF-8" $1 -

And configure this script in binaryfile.ini

Note that the default installation will "normalize" Latin1 characters, so eZ Find/Solr will transform "reçu" to "recu" and more ... so searching either form will produce the hit

Best regards

eZ Publish, eZ Find, Solr expert consulting and training
http://twitter.com/paulborgermans

eric figo

Monday 01 December 2008 2:27:53 am

HI,

Thanks for the response.

Some precisions, when I run pstotext in command line, with my pdf, I get the plain text without trouble.

The problem is when i use the script to index, the files with specials chars are not index.
I can't find them, even if I'm searching an over word ot the PDF without spécialchars.

I tried you solution with pdftotext but I have the same problem.

Best regards

Paul Borgermans

Friday 05 December 2008 1:17:21 pm

Which versions are you using (ez find, ez publish)?

eZ Publish, eZ Find, Solr expert consulting and training
http://twitter.com/paulborgermans

eric figo

Tuesday 09 December 2008 1:09:54 am

I'm using EZ Publish 4.0.1 with eZ Find 1.0.0beta2

Powered by eZ Publish™ CMS Open Source Web Content Management. Copyright © 1999-2014 eZ Systems AS (except where otherwise noted). All rights reserved.

eZ debug

Timing: Jan 18 2025 10:40:48
Script start
Timing: Jan 18 2025 10:40:48
Module start 'layout'
Timing: Jan 18 2025 10:40:48
Module start 'content'
Timing: Jan 18 2025 10:40:48
Module end 'content'
Timing: Jan 18 2025 10:40:48
Script end

Main resources:

Total runtime0.9217 sec
Peak memory usage4,096.0000 KB
Database Queries65

Timing points:

CheckpointStart (sec)Duration (sec)Memory at start (KB)Memory used (KB)
Script start 0.00000.0095 589.6172152.6563
Module start 'layout' 0.00950.0065 742.273439.5078
Module start 'content' 0.01600.9041 781.7813561.1328
Module end 'content' 0.92010.0016 1,342.914112.1016
Script end 0.9216  1,355.0156 

Time accumulators:

 Accumulator Duration (sec) Duration (%) Count Average (sec)
Ini load
Load cache0.00350.3771160.0002
Check MTime0.00130.1461160.0001
Mysql Total
Database connection0.00160.178610.0016
Mysqli_queries0.860493.3513650.0132
Looping result0.00070.0726630.0000
Template Total0.877995.320.4390
Template load0.00240.260120.0012
Template processing0.875594.993120.4378
Template load and register function0.00020.020110.0002
states
state_id_array0.00100.103510.0010
state_identifier_array0.00100.108820.0005
Override
Cache load0.00190.2097320.0001
Sytem overhead
Fetch class attribute can translate value0.00110.121020.0006
Fetch class attribute name0.00150.167760.0003
XML
Image XML parsing0.00070.071720.0003
class_abstraction
Instantiating content class attribute0.00000.004070.0000
General
dbfile0.00080.0895170.0000
String conversion0.00000.001140.0000
Note: percentages do not add up to 100% because some accumulators overlap

Templates used to render the page:

UsageRequested templateTemplateTemplate loadedEditOverride
1node/view/full.tplfull/forum_topic.tplextension/sevenx/design/simple/override/templates/full/forum_topic.tplEdit templateOverride template
5content/datatype/view/ezxmltext.tpl<No override>extension/community_design/design/suncana/templates/content/datatype/view/ezxmltext.tplEdit templateOverride template
8content/datatype/view/ezxmltags/paragraph.tpl<No override>extension/ezwebin/design/ezwebin/templates/content/datatype/view/ezxmltags/paragraph.tplEdit templateOverride template
3content/datatype/view/ezxmltags/line.tpl<No override>design/standard/templates/content/datatype/view/ezxmltags/line.tplEdit templateOverride template
2content/datatype/view/ezimage.tpl<No override>extension/sevenx/design/simple/templates/content/datatype/view/ezimage.tplEdit templateOverride template
1print_pagelayout.tpl<No override>extension/community/design/community/templates/print_pagelayout.tplEdit templateOverride template
 Number of times templates used: 20
 Number of unique templates used: 6

Time used to render debug report: 0.0001 secs