Indexing content of files using Solr (ezfind)

Author Message

Laurence Bonhomme

Wednesday 01 October 2008 1:36:23 am

Hi there,

I'm trying to index the content of files (txt, pdf) using eZFind + Solr and have troubles with it.

eZPublish 4.0.1
eZFind 1.0.0 beta2
Linux Debian 4

First of all, I've installed the eZPublish and eZFind package as recommended.
When I create a new media/file and upload a file (txt or pdf), indexing works perfectly and I can make searches (I can find my words into the database table ezkeyword as well).

Because I found the raw Search a bit "light", I decided to test with the Solr.
... And everything gets wrong now.

Pretty sure that the thing is well installed because I can search for articles contents or file summary into the admin Solr search.

But nothing about the <b>content of the uploaded file</b> itself.

What am I doing wrong?
Is there a trick?

Having a look at the Solr guide (http://wiki.apache.org/solr/), I found this :
"Solr has an extensible DocumentHandler architecture that allows you to feed it XML and CSV documents. There is now a patch file available as part of SOLR-284 that adds support for parsing rich binary formats. "

Do we have to patch the provided Solr?

Would anyone be so kind to help?

Thanks a lot
Laurence

Christian Rößler

Wednesday 22 October 2008 11:01:54 am

Laurence,

perhaps a bit late but better late than never :-)

I have had pretty the same problems. Perhaps this link will help you: http://ez.no/developer/articles/indexing_multiple_binary_file_types

I was able to setup a generic binaryfilehandler was was called for every physical file by ezfind. This binaryfilehandler was calling several external programs (pdftotxt, doc2txt), the parsed contents of each file was printed to stdout and catched by the binaryfilehandler, later on returned to ezflow, which saved it into ezsolr-index via a http-request.

The tricky point is to get ezfind use the custom file-handler to parse the binaryfile's content that ezflow/ezsolr can work with.

The above supplied link contains a full featured howto + downloads to get it working.
If you need any further help, feel free to reply to this post.

chris.

Hannover, Germany
eZ-Certified http://auth.ez.no/certification/verify/395613

Paul Borgermans

Wednesday 22 October 2008 2:21:15 pm

Something to point out here: the configuration for indexing files like pdf, word, ... depends on the configuration of eZ publish to convert these to plain text. It has nothing to do with the search plugin used (default, Solr/eZ Find, ...).

We'll improve the conversion mechanism options in eZ Publish for the next iteration of eZ Publish (4.1), I'm investigating a few more options to handle also more file formats.

You'll learn more about that very soon (< 3 weeks)

Paul

eZ Publish, eZ Find, Solr expert consulting and training
http://twitter.com/paulborgermans

Geoff Bentley

Wednesday 25 February 2009 3:21:12 pm

Check out Paul's ezTika extension ( http://projects.ez.no/eztika ) which draws on the Apache Tika toolkit ( http://lucene.apache.org/tika/ ) - this works seamlessly (so far) with eZ Find.

Powered by eZ Publish™ CMS Open Source Web Content Management. Copyright © 1999-2014 eZ Systems AS (except where otherwise noted). All rights reserved.

eZ debug

Timing: Jan 31 2025 04:18:07
Script start
Timing: Jan 31 2025 04:18:07
Module start 'layout'
Timing: Jan 31 2025 04:18:07
Module start 'content'
Timing: Jan 31 2025 04:18:08
Module end 'content'
Timing: Jan 31 2025 04:18:08
Script end

Main resources:

Total runtime1.0515 sec
Peak memory usage4,096.0000 KB
Database Queries64

Timing points:

CheckpointStart (sec)Duration (sec)Memory at start (KB)Memory used (KB)
Script start 0.00000.0048 588.3828151.2422
Module start 'layout' 0.00480.0023 739.625036.7109
Module start 'content' 0.00711.0436 776.33591,027.4531
Module end 'content' 1.05070.0008 1,803.789111.7969
Script end 1.0515  1,815.5859 

Time accumulators:

 Accumulator Duration (sec) Duration (%) Count Average (sec)
Ini load
Load cache0.00310.2911160.0002
Check MTime0.00130.1198160.0001
Mysql Total
Database connection0.00060.061610.0006
Mysqli_queries0.991794.3073640.0155
Looping result0.00060.0548620.0000
Template Total1.018496.820.5092
Template load0.00200.189320.0010
Template processing1.016496.656620.5082
Template load and register function0.00120.112010.0012
states
state_id_array0.00150.138210.0015
state_identifier_array0.00210.204120.0011
Override
Cache load0.00170.1611400.0000
Sytem overhead
Fetch class attribute can translate value0.00050.048840.0001
Fetch class attribute name0.00200.187660.0003
XML
Image XML parsing0.00340.324240.0009
class_abstraction
Instantiating content class attribute0.00000.001060.0000
General
dbfile0.00410.3868230.0002
String conversion0.00000.000640.0000
Note: percentages do not add up to 100% because some accumulators overlap

Templates used to render the page:

UsageRequested templateTemplateTemplate loadedEditOverride
1node/view/full.tplfull/forum_topic.tplextension/sevenx/design/simple/override/templates/full/forum_topic.tplEdit templateOverride template
4content/datatype/view/ezxmltext.tpl<No override>extension/community_design/design/suncana/templates/content/datatype/view/ezxmltext.tplEdit templateOverride template
11content/datatype/view/ezxmltags/paragraph.tpl<No override>extension/ezwebin/design/ezwebin/templates/content/datatype/view/ezxmltags/paragraph.tplEdit templateOverride template
7content/datatype/view/ezxmltags/line.tpl<No override>design/standard/templates/content/datatype/view/ezxmltags/line.tplEdit templateOverride template
2content/datatype/view/ezimage.tpl<No override>extension/sevenx/design/simple/templates/content/datatype/view/ezimage.tplEdit templateOverride template
1print_pagelayout.tpl<No override>extension/community/design/community/templates/print_pagelayout.tplEdit templateOverride template
 Number of times templates used: 26
 Number of unique templates used: 6

Time used to render debug report: 0.0001 secs