Forums / Extensions / eZ Find / Ez Find don't index PDF with special chars like "ç"

Ez Find don't index PDF with special chars like "ç"

Author Message

eric figo

Wednesday 19 November 2008 6:12:14 am

Hi,

I'm using EZ Find and pstotext in order to indexing PDF files.
Some are indexed and some not.

I tried many files and for exemple a pdf with this text single text is indexed : "Website un test d’indexation pour voir si ca marche ….. hsdhhdhhd"

But if i change the "c" for à "ç" like "Website un test d’indexation pour voir si ça marche ….. hsdhhdhhd", the pdf is not indexed.

Any ideas ? My database is in UTF 8, and i don't change the configurtaion of charset in EZ Publish.

Thanks for your responses

Paul Borgermans

Wednesday 26 November 2008 10:59:20 am

pstotext is not the best solution for converting pdf's to raw text, I guess that it fails to onvert the pdf file in question (try on the command line to se what happens)

Better is to use pdftotext from the xpdf project, then configure a new script, for example called ezpdftotext with the following content (change the path tp pdftotext with your installation):

#!/bin/sh
<path to >/pdftotext -enc "UTF-8" $1 -

And configure this script in binaryfile.ini

Note that the default installation will "normalize" Latin1 characters, so eZ Find/Solr will transform "reçu" to "recu" and more ... so searching either form will produce the hit

Best regards

eZ Publish, eZ Find, Solr expert consulting and training
http://twitter.com/paulborgermans

eric figo

Monday 01 December 2008 2:27:53 am

HI,

Thanks for the response.

Some precisions, when I run pstotext in command line, with my pdf, I get the plain text without trouble.

The problem is when i use the script to index, the files with specials chars are not index.
I can't find them, even if I'm searching an over word ot the PDF without spécialchars.

I tried you solution with pdftotext but I have the same problem.

Best regards

Paul Borgermans

Friday 05 December 2008 1:17:21 pm

Which versions are you using (ez find, ez publish)?

eZ Publish, eZ Find, Solr expert consulting and training
http://twitter.com/paulborgermans

eric figo

Tuesday 09 December 2008 1:09:54 am

I'm using EZ Publish 4.0.1 with eZ Find 1.0.0beta2

eZ debug

Timing: Jan 18 2025 03:09:09
Script start
Timing: Jan 18 2025 03:09:09
Module start 'content'
Timing: Jan 18 2025 03:09:10
Module end 'content'
Timing: Jan 18 2025 03:09:10
Script end

Main resources:

Total runtime0.5941 sec
Peak memory usage4,096.0000 KB
Database Queries202

Timing points:

CheckpointStart (sec)Duration (sec)Memory at start (KB)Memory used (KB)
Script start 0.00000.0112 587.9844180.7578
Module start 'content' 0.01120.4491 768.7422566.2422
Module end 'content' 0.46040.1336 1,334.9844337.3750
Script end 0.5940  1,672.3594 

Time accumulators:

 Accumulator Duration (sec) Duration (%) Count Average (sec)
Ini load
Load cache0.00410.6854210.0002
Check MTime0.00160.2693210.0001
Mysql Total
Database connection0.00130.210410.0013
Mysqli_queries0.519087.36022020.0026
Looping result0.00190.32552000.0000
Template Total0.556993.720.2784
Template load0.00180.306720.0009
Template processing0.555193.430720.2775
Template load and register function0.00020.040910.0002
states
state_id_array0.00080.130110.0008
state_identifier_array0.00140.241820.0007
Override
Cache load0.00150.2602320.0000
Sytem overhead
Fetch class attribute can translate value0.00160.272030.0005
Fetch class attribute name0.00120.205060.0002
XML
Image XML parsing0.00080.131130.0003
class_abstraction
Instantiating content class attribute0.00000.002370.0000
General
dbfile0.00350.5947280.0001
String conversion0.00000.001230.0000
Note: percentages do not add up to 100% because some accumulators overlap

CSS/JS files loaded with "ezjscPacker" during request:

CacheTypePacklevelSourceFiles
CSS0extension/community/design/community/stylesheets/ext/jquery.autocomplete.css
extension/community_design/design/suncana/stylesheets/scrollbars.css
extension/community_design/design/suncana/stylesheets/tabs.css
extension/community_design/design/suncana/stylesheets/roadmap.css
extension/community_design/design/suncana/stylesheets/content.css
extension/community_design/design/suncana/stylesheets/star-rating.css
extension/community_design/design/suncana/stylesheets/syntax_and_custom_tags.css
extension/community_design/design/suncana/stylesheets/buttons.css
extension/community_design/design/suncana/stylesheets/tweetbox.css
extension/community_design/design/suncana/stylesheets/jquery.fancybox-1.3.4.css
extension/bcsmoothgallery/design/standard/stylesheets/magnific-popup.css
extension/sevenx/design/simple/stylesheets/star_rating.css
extension/sevenx/design/simple/stylesheets/libs/fontawesome/css/all.min.css
extension/sevenx/design/simple/stylesheets/main.v02.css
extension/sevenx/design/simple/stylesheets/main.v02.res.css
JS0extension/ezjscore/design/standard/lib/yui/3.17.2/build/yui/yui-min.js
extension/ezjscore/design/standard/javascript/jquery-3.7.0.min.js
extension/community_design/design/suncana/javascript/jquery.ui.core.min.js
extension/community_design/design/suncana/javascript/jquery.ui.widget.min.js
extension/community_design/design/suncana/javascript/jquery.easing.1.3.js
extension/community_design/design/suncana/javascript/jquery.ui.tabs.js
extension/community_design/design/suncana/javascript/jquery.hoverIntent.min.js
extension/community_design/design/suncana/javascript/jquery.popmenu.js
extension/community_design/design/suncana/javascript/jScrollPane.js
extension/community_design/design/suncana/javascript/jquery.mousewheel.js
extension/community_design/design/suncana/javascript/jquery.cycle.all.js
extension/sevenx/design/simple/javascript/jquery.scrollTo.js
extension/community_design/design/suncana/javascript/jquery.cookie.js
extension/community_design/design/suncana/javascript/ezstarrating_jquery.js
extension/community_design/design/suncana/javascript/jquery.initboxes.js
extension/community_design/design/suncana/javascript/app.js
extension/community_design/design/suncana/javascript/twitterwidget.js
extension/community_design/design/suncana/javascript/community.js
extension/community_design/design/suncana/javascript/roadmap.js
extension/community_design/design/suncana/javascript/ez.js
extension/community_design/design/suncana/javascript/ezshareevents.js
extension/sevenx/design/simple/javascript/main.js

Templates used to render the page:

UsageRequested templateTemplateTemplate loadedEditOverride
1node/view/full.tplfull/forum_topic.tplextension/sevenx/design/simple/override/templates/full/forum_topic.tplEdit templateOverride template
5content/datatype/view/ezxmltext.tpl<No override>extension/community_design/design/suncana/templates/content/datatype/view/ezxmltext.tplEdit templateOverride template
8content/datatype/view/ezxmltags/paragraph.tpl<No override>extension/ezwebin/design/ezwebin/templates/content/datatype/view/ezxmltags/paragraph.tplEdit templateOverride template
3content/datatype/view/ezxmltags/line.tpl<No override>design/standard/templates/content/datatype/view/ezxmltags/line.tplEdit templateOverride template
2content/datatype/view/ezimage.tpl<No override>extension/sevenx/design/simple/templates/content/datatype/view/ezimage.tplEdit templateOverride template
1pagelayout.tpl<No override>extension/sevenx/design/simple/templates/pagelayout.tplEdit templateOverride template
 Number of times templates used: 20
 Number of unique templates used: 6

Time used to render debug report: 0.0001 secs