Forums / Developer / eZFind - indexing errors

eZFind - indexing errors

Author Message

Fabien Mas

Friday 11 September 2009 2:32:02 am

Hi,
I have a lot of errors when I index my site :

 **** Warning: Fonts with Subtype = /TrueType should be embedded.
               But Arial-ItalicMT is not embedded.


and so my object is not indexed after that ( not only the file datatype, all my object is not indexed)

How can I solve it ?

Thx

Paul Borgermans

Friday 11 September 2009 2:58:06 am

What do you use for conversion of binary files?

It seems you use pstotext as is the default setting (but far from the best)

Paul

eZ Publish, eZ Find, Solr expert consulting and training
http://twitter.com/paulborgermans

Fabien Mas

Friday 11 September 2009 5:01:22 am

Hi Paul,
Effectively, I am using pstotext
Which one do you advice me to use ?

thx for your help :)

Fabien

Vincent Tabary

Friday 11 September 2009 5:31:52 am

Hi all,

That could be interesting for me too :)

I installed pstotext because eZFind asked for it but I do not know any other software for that

Vinz
http://vincent.tabary.me

Fabien Mas

Friday 11 September 2009 5:40:38 am

I have activated the eztika extension but I have also some troubles

Exception in thread "main" org.apache.tika.exception.TikaException: Unable to extract PDF content
at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:58)
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:51)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:108)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:80)
at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:111)
at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:57)

Paul Borgermans

Friday 11 September 2009 12:32:31 pm

Hello eztika is not too robust wrt asian character sets, but should be fine with others

For pdf in general, the best is to use xpdf tools

You need to create a wrapper script for xpdf's pdftotext utility

This is what I use (locally called ezpdftotext):

#!/bin/sh
/opt/local/bin/pdfinfo $1 >> /tmp/ezpdftotext.log
/opt/local/bin/pdftotext -enc "UTF-8" $1 -

the pdfinfo line is used for logging and can be suppressed if all goes well configuration wise

So all considered: use eztika for everything except pdf, for which you should use xpdf
Expect eztika to improve in the future, it is also getting into Solr (and when stable enough, eZ Find will use that instead of the binary file wrappers)

Cheers
Paul

eZ Publish, eZ Find, Solr expert consulting and training
http://twitter.com/paulborgermans

Fabien Mas

Monday 14 September 2009 7:59:32 am

Hi Paul,

I have created my own parser using xpdf.
I have no error now.
I log the text generated and it's ok

But I have a new problem ;)
With the default searchengine, it works well but with ezfind activated, no word of my file is indexed (even if xpdf works well)

When I search a word, I have no result
Is there a specific thing to do for ezfind ?

Thx,
Fabien

Fabien Mas

Thursday 17 September 2009 1:41:30 am

I got it :)
That was the pagebreaks who made mischief in the xml generated by solr

so now I use this code and it works fine :

pdftotext -enc "UTF-8" -eol unix -nopgbrk $1 -