Forums / Install & configuration / Indexing content of files using Solr (ezfind)

Indexing content of files using Solr (ezfind)

Author Message

Laurence Bonhomme

Wednesday 01 October 2008 1:36:23 am

Hi there,

I'm trying to index the content of files (txt, pdf) using eZFind + Solr and have troubles with it.

eZPublish 4.0.1
eZFind 1.0.0 beta2
Linux Debian 4

First of all, I've installed the eZPublish and eZFind package as recommended.
When I create a new media/file and upload a file (txt or pdf), indexing works perfectly and I can make searches (I can find my words into the database table ezkeyword as well).

Because I found the raw Search a bit "light", I decided to test with the Solr.
... And everything gets wrong now.

Pretty sure that the thing is well installed because I can search for articles contents or file summary into the admin Solr search.

But nothing about the <b>content of the uploaded file</b> itself.

What am I doing wrong?
Is there a trick?

Having a look at the Solr guide (http://wiki.apache.org/solr/), I found this :
"Solr has an extensible DocumentHandler architecture that allows you to feed it XML and CSV documents. There is now a patch file available as part of SOLR-284 that adds support for parsing rich binary formats. "

Do we have to patch the provided Solr?

Would anyone be so kind to help?

Thanks a lot
Laurence

Christian Rößler

Wednesday 22 October 2008 11:01:54 am

Laurence,

perhaps a bit late but better late than never :-)

I have had pretty the same problems. Perhaps this link will help you: http://ez.no/developer/articles/indexing_multiple_binary_file_types

I was able to setup a generic binaryfilehandler was was called for every physical file by ezfind. This binaryfilehandler was calling several external programs (pdftotxt, doc2txt), the parsed contents of each file was printed to stdout and catched by the binaryfilehandler, later on returned to ezflow, which saved it into ezsolr-index via a http-request.

The tricky point is to get ezfind use the custom file-handler to parse the binaryfile's content that ezflow/ezsolr can work with.

The above supplied link contains a full featured howto + downloads to get it working.
If you need any further help, feel free to reply to this post.

chris.

Hannover, Germany
eZ-Certified http://auth.ez.no/certification/verify/395613

Paul Borgermans

Wednesday 22 October 2008 2:21:15 pm

Something to point out here: the configuration for indexing files like pdf, word, ... depends on the configuration of eZ publish to convert these to plain text. It has nothing to do with the search plugin used (default, Solr/eZ Find, ...).

We'll improve the conversion mechanism options in eZ Publish for the next iteration of eZ Publish (4.1), I'm investigating a few more options to handle also more file formats.

You'll learn more about that very soon (< 3 weeks)

Paul

eZ Publish, eZ Find, Solr expert consulting and training
http://twitter.com/paulborgermans

Geoff Bentley

Wednesday 25 February 2009 3:21:12 pm

Check out Paul's ezTika extension ( http://projects.ez.no/eztika ) which draws on the Apache Tika toolkit ( http://lucene.apache.org/tika/ ) - this works seamlessly (so far) with eZ Find.

eZ debug

Timing: Jan 18 2025 15:51:53
Script start
Timing: Jan 18 2025 15:51:53
Module start 'content'
Timing: Jan 18 2025 15:51:53
Module end 'content'
Timing: Jan 18 2025 15:51:53
Script end

Main resources:

Total runtime0.1502 sec
Peak memory usage4,096.0000 KB
Database Queries141

Timing points:

CheckpointStart (sec)Duration (sec)Memory at start (KB)Memory used (KB)
Script start 0.00000.0056 589.0781180.8047
Module start 'content' 0.00560.0048 769.882897.9766
Module end 'content' 0.01040.1398 867.8594528.3750
Script end 0.1501  1,396.2344 

Time accumulators:

 Accumulator Duration (sec) Duration (%) Count Average (sec)
Ini load
Load cache0.00322.1209200.0002
Check MTime0.00120.8263200.0001
Mysql Total
Database connection0.00090.615210.0009
Mysqli_queries0.111073.90181410.0008
Looping result0.00120.80361390.0000
Template Total0.139592.910.1395
Template load0.00070.451510.0007
Template processing0.138892.420710.1388
Override
Cache load0.00050.304210.0005
Sytem overhead
Fetch class attribute can translate value0.00040.295010.0004
XML
Image XML parsing0.00020.140510.0002
General
dbfile0.00150.9909200.0001
String conversion0.00000.002730.0000
Note: percentages do not add up to 100% because some accumulators overlap

CSS/JS files loaded with "ezjscPacker" during request:

CacheTypePacklevelSourceFiles
CSS0extension/community/design/community/stylesheets/ext/jquery.autocomplete.css
extension/community_design/design/suncana/stylesheets/scrollbars.css
extension/community_design/design/suncana/stylesheets/tabs.css
extension/community_design/design/suncana/stylesheets/roadmap.css
extension/community_design/design/suncana/stylesheets/content.css
extension/community_design/design/suncana/stylesheets/star-rating.css
extension/community_design/design/suncana/stylesheets/syntax_and_custom_tags.css
extension/community_design/design/suncana/stylesheets/buttons.css
extension/community_design/design/suncana/stylesheets/tweetbox.css
extension/community_design/design/suncana/stylesheets/jquery.fancybox-1.3.4.css
extension/bcsmoothgallery/design/standard/stylesheets/magnific-popup.css
extension/sevenx/design/simple/stylesheets/star_rating.css
extension/sevenx/design/simple/stylesheets/libs/fontawesome/css/all.min.css
extension/sevenx/design/simple/stylesheets/main.v02.css
extension/sevenx/design/simple/stylesheets/main.v02.res.css
JS0extension/ezjscore/design/standard/lib/yui/3.17.2/build/yui/yui-min.js
extension/ezjscore/design/standard/javascript/jquery-3.7.0.min.js
extension/community_design/design/suncana/javascript/jquery.ui.core.min.js
extension/community_design/design/suncana/javascript/jquery.ui.widget.min.js
extension/community_design/design/suncana/javascript/jquery.easing.1.3.js
extension/community_design/design/suncana/javascript/jquery.ui.tabs.js
extension/community_design/design/suncana/javascript/jquery.hoverIntent.min.js
extension/community_design/design/suncana/javascript/jquery.popmenu.js
extension/community_design/design/suncana/javascript/jScrollPane.js
extension/community_design/design/suncana/javascript/jquery.mousewheel.js
extension/community_design/design/suncana/javascript/jquery.cycle.all.js
extension/sevenx/design/simple/javascript/jquery.scrollTo.js
extension/community_design/design/suncana/javascript/jquery.cookie.js
extension/community_design/design/suncana/javascript/ezstarrating_jquery.js
extension/community_design/design/suncana/javascript/jquery.initboxes.js
extension/community_design/design/suncana/javascript/app.js
extension/community_design/design/suncana/javascript/twitterwidget.js
extension/community_design/design/suncana/javascript/community.js
extension/community_design/design/suncana/javascript/roadmap.js
extension/community_design/design/suncana/javascript/ez.js
extension/community_design/design/suncana/javascript/ezshareevents.js
extension/sevenx/design/simple/javascript/main.js

Templates used to render the page:

UsageRequested templateTemplateTemplate loadedEditOverride
1pagelayout.tpl<No override>extension/sevenx/design/simple/templates/pagelayout.tplEdit templateOverride template
 Number of times templates used: 1
 Number of unique templates used: 1

Time used to render debug report: 0.0001 secs