Forums / Developer / eZFind - indexing errors

eZFind - indexing errors

Author Message

Fabien Mas

Friday 11 September 2009 2:32:02 am

Hi,
I have a lot of errors when I index my site :

 **** Warning: Fonts with Subtype = /TrueType should be embedded.
               But Arial-ItalicMT is not embedded.


and so my object is not indexed after that ( not only the file datatype, all my object is not indexed)

How can I solve it ?

Thx

Paul Borgermans

Friday 11 September 2009 2:58:06 am

What do you use for conversion of binary files?

It seems you use pstotext as is the default setting (but far from the best)

Paul

eZ Publish, eZ Find, Solr expert consulting and training
http://twitter.com/paulborgermans

Fabien Mas

Friday 11 September 2009 5:01:22 am

Hi Paul,
Effectively, I am using pstotext
Which one do you advice me to use ?

thx for your help :)

Fabien

Vincent Tabary

Friday 11 September 2009 5:31:52 am

Hi all,

That could be interesting for me too :)

I installed pstotext because eZFind asked for it but I do not know any other software for that

Vinz
http://vincent.tabary.me

Fabien Mas

Friday 11 September 2009 5:40:38 am

I have activated the eztika extension but I have also some troubles

Exception in thread "main" org.apache.tika.exception.TikaException: Unable to extract PDF content
at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:58)
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:51)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:108)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:80)
at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:111)
at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:57)

Paul Borgermans

Friday 11 September 2009 12:32:31 pm

Hello eztika is not too robust wrt asian character sets, but should be fine with others

For pdf in general, the best is to use xpdf tools

You need to create a wrapper script for xpdf's pdftotext utility

This is what I use (locally called ezpdftotext):

#!/bin/sh
/opt/local/bin/pdfinfo $1 >> /tmp/ezpdftotext.log
/opt/local/bin/pdftotext -enc "UTF-8" $1 -

the pdfinfo line is used for logging and can be suppressed if all goes well configuration wise

So all considered: use eztika for everything except pdf, for which you should use xpdf
Expect eztika to improve in the future, it is also getting into Solr (and when stable enough, eZ Find will use that instead of the binary file wrappers)

Cheers
Paul

eZ Publish, eZ Find, Solr expert consulting and training
http://twitter.com/paulborgermans

Fabien Mas

Monday 14 September 2009 7:59:32 am

Hi Paul,

I have created my own parser using xpdf.
I have no error now.
I log the text generated and it's ok

But I have a new problem ;)
With the default searchengine, it works well but with ezfind activated, no word of my file is indexed (even if xpdf works well)

When I search a word, I have no result
Is there a specific thing to do for ezfind ?

Thx,
Fabien

Fabien Mas

Thursday 17 September 2009 1:41:30 am

I got it :)
That was the pagebreaks who made mischief in the xml generated by solr

so now I use this code and it works fine :

pdftotext -enc "UTF-8" -eol unix -nopgbrk $1 -

eZ debug

Timing: Jan 18 2025 05:00:12
Script start
Timing: Jan 18 2025 05:00:12
Module start 'content'
Timing: Jan 18 2025 05:00:14
Module end 'content'
Timing: Jan 18 2025 05:00:14
Script end

Main resources:

Total runtime2.0010 sec
Peak memory usage4,096.0000 KB
Database Queries211

Timing points:

CheckpointStart (sec)Duration (sec)Memory at start (KB)Memory used (KB)
Script start 0.00000.0086 588.8047180.8438
Module start 'content' 0.00861.7205 769.6484640.1641
Module end 'content' 1.72910.2718 1,409.8125341.0781
Script end 2.0009  1,750.8906 

Time accumulators:

 Accumulator Duration (sec) Duration (%) Count Average (sec)
Ini load
Load cache0.00420.2088210.0002
Check MTime0.00150.0767210.0001
Mysql Total
Database connection0.00300.150810.0030
Mysqli_queries1.891194.50872110.0090
Looping result0.00260.13182090.0000
Template Total1.954597.720.9773
Template load0.00280.139720.0014
Template processing1.951797.536920.9759
Template load and register function0.00030.016710.0003
states
state_id_array0.00100.049110.0010
state_identifier_array0.00270.136520.0014
Override
Cache load0.00250.1253560.0000
Sytem overhead
Fetch class attribute can translate value0.00410.206740.0010
Fetch class attribute name0.00200.099290.0002
XML
Image XML parsing0.00250.123640.0006
class_abstraction
Instantiating content class attribute0.00000.0018100.0000
General
dbfile0.00780.3885280.0003
String conversion0.00000.000230.0000
Note: percentages do not add up to 100% because some accumulators overlap

CSS/JS files loaded with "ezjscPacker" during request:

CacheTypePacklevelSourceFiles
CSS0extension/community/design/community/stylesheets/ext/jquery.autocomplete.css
extension/community_design/design/suncana/stylesheets/scrollbars.css
extension/community_design/design/suncana/stylesheets/tabs.css
extension/community_design/design/suncana/stylesheets/roadmap.css
extension/community_design/design/suncana/stylesheets/content.css
extension/community_design/design/suncana/stylesheets/star-rating.css
extension/community_design/design/suncana/stylesheets/syntax_and_custom_tags.css
extension/community_design/design/suncana/stylesheets/buttons.css
extension/community_design/design/suncana/stylesheets/tweetbox.css
extension/community_design/design/suncana/stylesheets/jquery.fancybox-1.3.4.css
extension/bcsmoothgallery/design/standard/stylesheets/magnific-popup.css
extension/sevenx/design/simple/stylesheets/star_rating.css
extension/sevenx/design/simple/stylesheets/libs/fontawesome/css/all.min.css
extension/sevenx/design/simple/stylesheets/main.v02.css
extension/sevenx/design/simple/stylesheets/main.v02.res.css
JS0extension/ezjscore/design/standard/lib/yui/3.17.2/build/yui/yui-min.js
extension/ezjscore/design/standard/javascript/jquery-3.7.0.min.js
extension/community_design/design/suncana/javascript/jquery.ui.core.min.js
extension/community_design/design/suncana/javascript/jquery.ui.widget.min.js
extension/community_design/design/suncana/javascript/jquery.easing.1.3.js
extension/community_design/design/suncana/javascript/jquery.ui.tabs.js
extension/community_design/design/suncana/javascript/jquery.hoverIntent.min.js
extension/community_design/design/suncana/javascript/jquery.popmenu.js
extension/community_design/design/suncana/javascript/jScrollPane.js
extension/community_design/design/suncana/javascript/jquery.mousewheel.js
extension/community_design/design/suncana/javascript/jquery.cycle.all.js
extension/sevenx/design/simple/javascript/jquery.scrollTo.js
extension/community_design/design/suncana/javascript/jquery.cookie.js
extension/community_design/design/suncana/javascript/ezstarrating_jquery.js
extension/community_design/design/suncana/javascript/jquery.initboxes.js
extension/community_design/design/suncana/javascript/app.js
extension/community_design/design/suncana/javascript/twitterwidget.js
extension/community_design/design/suncana/javascript/community.js
extension/community_design/design/suncana/javascript/roadmap.js
extension/community_design/design/suncana/javascript/ez.js
extension/community_design/design/suncana/javascript/ezshareevents.js
extension/sevenx/design/simple/javascript/main.js

Templates used to render the page:

UsageRequested templateTemplateTemplate loadedEditOverride
1node/view/full.tplfull/forum_topic.tplextension/sevenx/design/simple/override/templates/full/forum_topic.tplEdit templateOverride template
8content/datatype/view/ezxmltext.tpl<No override>extension/community_design/design/suncana/templates/content/datatype/view/ezxmltext.tplEdit templateOverride template
10content/datatype/view/ezxmltags/line.tpl<No override>design/standard/templates/content/datatype/view/ezxmltags/line.tplEdit templateOverride template
16content/datatype/view/ezxmltags/paragraph.tpl<No override>extension/ezwebin/design/ezwebin/templates/content/datatype/view/ezxmltags/paragraph.tplEdit templateOverride template
4content/datatype/view/ezxmltags/literal.tpl<No override>extension/community/design/standard/templates/content/datatype/view/ezxmltags/literal.tplEdit templateOverride template
2content/datatype/view/ezimage.tpl<No override>extension/sevenx/design/simple/templates/content/datatype/view/ezimage.tplEdit templateOverride template
1pagelayout.tpl<No override>extension/sevenx/design/simple/templates/pagelayout.tplEdit templateOverride template
 Number of times templates used: 42
 Number of unique templates used: 7

Time used to render debug report: 0.0002 secs