Forums / General / Searching Office Documents: ez's search vs. integration of "commercial" search engine

Searching Office Documents: ez's search vs. integration of "commercial" search engine

Author Message

Marco Zinn

Saturday 04 October 2003 7:35:29 am

We are currently working on setting up an ez 3.2 system, which focusses on document management a lot. We have lots of Office documents and PDFs, which will be "file" objects.
Using the new binary file indexing, we can index PDF quite nicely with pdftotext (Thanks to Paul).
We have some problems with indexing Word (97) documents, as the results of the wvware converter is unusable at the moment. I don't know, why this happens, but even a blank word-document with just some test lines is not indexed (converted) correctly.

Furthermore, we need indexing of PowerPoint (97) and Excel (97) documents. Also, we have some ZIPed files, which include this kind of office docs+pdfs.
These issues are not yet solved by ez's search engine, so we think about using a 3rd-party solution for searching / indexing / crawling the content. This may be a commercial solution as well.

I'd like to know, which experiences other users (you?) have with this issue.
Can anyone recommend some (non-ez) search engine?
Or can anyone at least give us some hints, where will find tools for indexing PowerPoint files? Our server uses solaris, btw.
As this if for an intranet site, we cannot use indexing services; we need a solution/software, that we can install ourselfs.
Oh, one more neat thing: When we use an external search engine, this should also take care of ez's permissions, as we have quite some content, which required a login.

Marco Zinn

Marco
http://www.hyperroad-design.com

Paul Borgermans

Saturday 04 October 2003 8:25:48 am

Hi Marco

The search engine indexing is an issue which hinders ez publish to work as a good DMS or for any large web site. Especially the ranking or relevance is not up to the level actually required. And users DO rely on global search. The ez crew is certainly aware of this, but I don't know what the future will bring (apparently only a few are asking for binary file indexing for instance). I haven't played with openfts yet.

Below my hints (but not yet done it myself, powerpoint is the most urgent for me)

---------powerpoint and excel--------
For powerpoint and excel files, you may try

http://chicago.sourceforge.net/xlhtml/

the powerpoint conversion is included in the xlhtml archive.

You will also need lynx to do html to text conversion and wrap everything in a shell script to be called by the binary file handler. Idem dito for zipped versions.

--------msword--------
I'm surprisd wvware does not work for you, since I got it nicely running. Does it work on the command-line? I first had to make sure the right xml config was actually there (on SuSE Linux). Do you have the most recent version?

--------zipped --- files
Add a cpu or two and wrap unzip,gunzip in a shell script

---openoffice-----
There are some xslt filters included which should work after unzipping.

---wordperfect---
See the openoffice filter, its standalone!

I hope openoffice will provide more command-line options, so we can use it as a vehicle for all kinds of office formats

Have a nice weekend

-paul

eZ Publish, eZ Find, Solr expert consulting and training
http://twitter.com/paulborgermans

eZ debug

Timing: Jan 18 2025 11:21:51
Script start
Timing: Jan 18 2025 11:21:51
Module start 'content'
Timing: Jan 18 2025 11:21:52
Module end 'content'
Timing: Jan 18 2025 11:21:52
Script end

Main resources:

Total runtime0.7486 sec
Peak memory usage4,096.0000 KB
Database Queries191

Timing points:

CheckpointStart (sec)Duration (sec)Memory at start (KB)Memory used (KB)
Script start 0.00000.0070 589.2031180.7891
Module start 'content' 0.00700.5755 769.9922550.6641
Module end 'content' 0.58250.1660 1,320.6563336.7188
Script end 0.7485  1,657.3750 

Time accumulators:

 Accumulator Duration (sec) Duration (%) Count Average (sec)
Ini load
Load cache0.00490.6609210.0002
Check MTime0.00150.2065210.0001
Mysql Total
Database connection0.00110.149110.0011
Mysqli_queries0.659388.07441910.0035
Looping result0.00220.29121890.0000
Template Total0.713895.320.3569
Template load0.00260.348920.0013
Template processing0.711194.994420.3556
Template load and register function0.00010.012410.0001
states
state_id_array0.00120.158610.0012
state_identifier_array0.00100.137320.0005
Override
Cache load0.00220.2928360.0001
Sytem overhead
Fetch class attribute can translate value0.00140.188630.0005
Fetch class attribute name0.00210.280440.0005
XML
Image XML parsing0.00120.162030.0004
class_abstraction
Instantiating content class attribute0.00000.001140.0000
General
dbfile0.00811.0825340.0002
String conversion0.00000.000830.0000
Note: percentages do not add up to 100% because some accumulators overlap

CSS/JS files loaded with "ezjscPacker" during request:

CacheTypePacklevelSourceFiles
CSS0extension/community/design/community/stylesheets/ext/jquery.autocomplete.css
extension/community_design/design/suncana/stylesheets/scrollbars.css
extension/community_design/design/suncana/stylesheets/tabs.css
extension/community_design/design/suncana/stylesheets/roadmap.css
extension/community_design/design/suncana/stylesheets/content.css
extension/community_design/design/suncana/stylesheets/star-rating.css
extension/community_design/design/suncana/stylesheets/syntax_and_custom_tags.css
extension/community_design/design/suncana/stylesheets/buttons.css
extension/community_design/design/suncana/stylesheets/tweetbox.css
extension/community_design/design/suncana/stylesheets/jquery.fancybox-1.3.4.css
extension/bcsmoothgallery/design/standard/stylesheets/magnific-popup.css
extension/sevenx/design/simple/stylesheets/star_rating.css
extension/sevenx/design/simple/stylesheets/libs/fontawesome/css/all.min.css
extension/sevenx/design/simple/stylesheets/main.v02.css
extension/sevenx/design/simple/stylesheets/main.v02.res.css
JS0extension/ezjscore/design/standard/lib/yui/3.17.2/build/yui/yui-min.js
extension/ezjscore/design/standard/javascript/jquery-3.7.0.min.js
extension/community_design/design/suncana/javascript/jquery.ui.core.min.js
extension/community_design/design/suncana/javascript/jquery.ui.widget.min.js
extension/community_design/design/suncana/javascript/jquery.easing.1.3.js
extension/community_design/design/suncana/javascript/jquery.ui.tabs.js
extension/community_design/design/suncana/javascript/jquery.hoverIntent.min.js
extension/community_design/design/suncana/javascript/jquery.popmenu.js
extension/community_design/design/suncana/javascript/jScrollPane.js
extension/community_design/design/suncana/javascript/jquery.mousewheel.js
extension/community_design/design/suncana/javascript/jquery.cycle.all.js
extension/sevenx/design/simple/javascript/jquery.scrollTo.js
extension/community_design/design/suncana/javascript/jquery.cookie.js
extension/community_design/design/suncana/javascript/ezstarrating_jquery.js
extension/community_design/design/suncana/javascript/jquery.initboxes.js
extension/community_design/design/suncana/javascript/app.js
extension/community_design/design/suncana/javascript/twitterwidget.js
extension/community_design/design/suncana/javascript/community.js
extension/community_design/design/suncana/javascript/roadmap.js
extension/community_design/design/suncana/javascript/ez.js
extension/community_design/design/suncana/javascript/ezshareevents.js
extension/sevenx/design/simple/javascript/main.js

Templates used to render the page:

UsageRequested templateTemplateTemplate loadedEditOverride
1node/view/full.tplfull/forum_topic.tplextension/sevenx/design/simple/override/templates/full/forum_topic.tplEdit templateOverride template
2content/datatype/view/ezimage.tpl<No override>extension/sevenx/design/simple/templates/content/datatype/view/ezimage.tplEdit templateOverride template
2content/datatype/view/ezxmltext.tpl<No override>extension/community_design/design/suncana/templates/content/datatype/view/ezxmltext.tplEdit templateOverride template
8content/datatype/view/ezxmltags/line.tpl<No override>design/standard/templates/content/datatype/view/ezxmltags/line.tplEdit templateOverride template
9content/datatype/view/ezxmltags/paragraph.tpl<No override>extension/ezwebin/design/ezwebin/templates/content/datatype/view/ezxmltags/paragraph.tplEdit templateOverride template
1pagelayout.tpl<No override>extension/sevenx/design/simple/templates/pagelayout.tplEdit templateOverride template
 Number of times templates used: 23
 Number of unique templates used: 6

Time used to render debug report: 0.0002 secs