eZFind - indexing errors

Author Message

Fabien Mas

Friday 11 September 2009 2:32:02 am

Hi,
I have a lot of errors when I index my site :

 **** Warning: Fonts with Subtype = /TrueType should be embedded.
               But Arial-ItalicMT is not embedded.


and so my object is not indexed after that ( not only the file datatype, all my object is not indexed)

How can I solve it ?

Thx

Paul Borgermans

Friday 11 September 2009 2:58:06 am

What do you use for conversion of binary files?

It seems you use pstotext as is the default setting (but far from the best)

Paul

eZ Publish, eZ Find, Solr expert consulting and training
http://twitter.com/paulborgermans

Fabien Mas

Friday 11 September 2009 5:01:22 am

Hi Paul,
Effectively, I am using pstotext
Which one do you advice me to use ?

thx for your help :)

Fabien

Vincent Tabary

Friday 11 September 2009 5:31:52 am

Hi all,

That could be interesting for me too :)

I installed pstotext because eZFind asked for it but I do not know any other software for that

Vinz
http://vincent.tabary.me

Fabien Mas

Friday 11 September 2009 5:40:38 am

I have activated the eztika extension but I have also some troubles

Exception in thread "main" org.apache.tika.exception.TikaException: Unable to extract PDF content
at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:58)
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:51)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:108)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:80)
at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:111)
at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:57)

Paul Borgermans

Friday 11 September 2009 12:32:31 pm

Hello eztika is not too robust wrt asian character sets, but should be fine with others

For pdf in general, the best is to use xpdf tools

You need to create a wrapper script for xpdf's pdftotext utility

This is what I use (locally called ezpdftotext):

#!/bin/sh
/opt/local/bin/pdfinfo $1 >> /tmp/ezpdftotext.log
/opt/local/bin/pdftotext -enc "UTF-8" $1 -

the pdfinfo line is used for logging and can be suppressed if all goes well configuration wise

So all considered: use eztika for everything except pdf, for which you should use xpdf
Expect eztika to improve in the future, it is also getting into Solr (and when stable enough, eZ Find will use that instead of the binary file wrappers)

Cheers
Paul

eZ Publish, eZ Find, Solr expert consulting and training
http://twitter.com/paulborgermans

Fabien Mas

Monday 14 September 2009 7:59:32 am

Hi Paul,

I have created my own parser using xpdf.
I have no error now.
I log the text generated and it's ok

But I have a new problem ;)
With the default searchengine, it works well but with ezfind activated, no word of my file is indexed (even if xpdf works well)

When I search a word, I have no result
Is there a specific thing to do for ezfind ?

Thx,
Fabien

Fabien Mas

Thursday 17 September 2009 1:41:30 am

I got it :)
That was the pagebreaks who made mischief in the xml generated by solr

so now I use this code and it works fine :

pdftotext -enc "UTF-8" -eol unix -nopgbrk $1 -

Powered by eZ Publish™ CMS Open Source Web Content Management. Copyright © 1999-2014 eZ Systems AS (except where otherwise noted). All rights reserved.

eZ debug

Timing: Jan 18 2025 11:25:10
Script start
Timing: Jan 18 2025 11:25:10
Module start 'layout'
Timing: Jan 18 2025 11:25:10
Module start 'content'
Timing: Jan 18 2025 11:25:11
Module end 'content'
Timing: Jan 18 2025 11:25:11
Script end

Main resources:

Total runtime0.9422 sec
Peak memory usage4,096.0000 KB
Database Queries74

Timing points:

CheckpointStart (sec)Duration (sec)Memory at start (KB)Memory used (KB)
Script start 0.00000.0047 587.9063152.6250
Module start 'layout' 0.00470.0022 740.531339.4297
Module start 'content' 0.00690.9339 779.9609635.0859
Module end 'content' 0.94080.0013 1,415.046916.1797
Script end 0.9421  1,431.2266 

Time accumulators:

 Accumulator Duration (sec) Duration (%) Count Average (sec)
Ini load
Load cache0.00290.3082160.0002
Check MTime0.00130.1354160.0001
Mysql Total
Database connection0.00080.084110.0008
Mysqli_queries0.819686.9876740.0111
Looping result0.00070.0744720.0000
Template Total0.848490.020.4242
Template load0.00210.218720.0010
Template processing0.846389.822820.4231
Template load and register function0.00010.014610.0001
states
state_id_array0.00180.195010.0018
state_identifier_array0.00200.210920.0010
Override
Cache load0.00170.1771560.0000
Sytem overhead
Fetch class attribute can translate value0.00080.083130.0003
Fetch class attribute name0.00100.110690.0001
XML
Image XML parsing0.00090.095130.0003
class_abstraction
Instantiating content class attribute0.00000.0026100.0000
General
dbfile0.00080.0890170.0000
String conversion0.00000.000640.0000
Note: percentages do not add up to 100% because some accumulators overlap

Templates used to render the page:

UsageRequested templateTemplateTemplate loadedEditOverride
1node/view/full.tplfull/forum_topic.tplextension/sevenx/design/simple/override/templates/full/forum_topic.tplEdit templateOverride template
8content/datatype/view/ezxmltext.tpl<No override>extension/community_design/design/suncana/templates/content/datatype/view/ezxmltext.tplEdit templateOverride template
10content/datatype/view/ezxmltags/line.tpl<No override>design/standard/templates/content/datatype/view/ezxmltags/line.tplEdit templateOverride template
16content/datatype/view/ezxmltags/paragraph.tpl<No override>extension/ezwebin/design/ezwebin/templates/content/datatype/view/ezxmltags/paragraph.tplEdit templateOverride template
4content/datatype/view/ezxmltags/literal.tpl<No override>extension/community/design/standard/templates/content/datatype/view/ezxmltags/literal.tplEdit templateOverride template
2content/datatype/view/ezimage.tpl<No override>extension/sevenx/design/simple/templates/content/datatype/view/ezimage.tplEdit templateOverride template
1print_pagelayout.tpl<No override>extension/community/design/community/templates/print_pagelayout.tplEdit templateOverride template
 Number of times templates used: 42
 Number of unique templates used: 7

Time used to render debug report: 0.0001 secs