Forums / Extensions / eZ Find / Solr Indexing Error

Solr Indexing Error

Author Message

Sylvain Gogel

Tuesday 07 April 2009 5:04:14 am

Hi there running ezfind2 indexation i notice some data are not indexed >_<

Doing some digging i found out that Solr::addDocs() got some serious issues

   function addDocs ( $docs = array(), $commit = true, $optimize = false  )
    {
        //
        if (! is_array( $docs ) )
        {
        	echo("docs is not an array\n");
            return false;
        }
        if ( count ( $docs ) == 0)
        {
        	echo("docs is empty\n");
        	return false;
        }
        else
        {
            $postString = '<add>';
            foreach ( $docs as $doc )
            {
                $postString .= $doc->docToXML();
            }
            $postString .= '</add>';
            
			//echo($postString."\n");
			
            $updateResult = $this->postQuery ( '/update', $postString, 'text/xml' );
			echo $updateResult;

This last echo output some java errors:

<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"/>
<title>Error 500 </title>
</head>
<body><h2>HTTP ERROR: 500</h2><pre>ParseError at [row,col]:[25,1]
Message: An invalid XML character (Unicode: 0xc) was found in the element content of the document.

javax.xml.stream.XMLStreamException: ParseError at [row,col]:[25,1]
Message: An invalid XML character (Unicode: 0xc) was found in the element content of the document.
  at com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.next(XMLStreamReaderImpl.java:588)
  at org.apache.solr.handler.XmlUpdateRequestHandler.readDoc(XmlUpdateRequestHandler.java:321)
  at org.apache.solr.handler.XmlUpdateRequestHandler.processUpdate(XmlUpdateRequestHandler.java:195)
  at org.apache.solr.handler.XmlUpdateRequestHandler.handleRequestBody(XmlUpdateRequestHandler.java:123)
  at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
  at org.apache.solr.core.SolrCore.execute(SolrCore.java:1204)
  at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:303)
  at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:232)
  at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
  at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
  at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
  at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
  at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
  at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
  at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
  at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
  at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
  at org.mortbay.jetty.Server.handle(Server.java:285)
  at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
  at org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:835)
  at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641)
  at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:202)
  at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
  at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
  at org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)
</pre>
<p>RequestURI=/solr/update</p><p><i><small><a href="http://jetty.mortbay.org/">Powered by Jetty://</a></small></i></p><br/>

Obvisouly the generated xml is not parsable and the resulting content is not indexed !
The content object contains binary pdf files and images.

Anyone got a fix for EzFind stable?

--
http://www.ecedi.fr
Agence Web, Créa/Conseils, Accessibilité
eZPublish, Drupal, Zend, Symfony

Sylvain Gogel

Tuesday 07 April 2009 5:12:41 am

I use both

[PDFHandlerSettings]
TextExtractionTool=pstotext

and

[PDFHandlerSettings]
TextExtractionTool=mypdftotext

the last is a shell script based on xpdf tool pdftotext

#!/bin/sh
/usr/bin/pdftotext -enc "UTF-8" $1 -

--
http://www.ecedi.fr
Agence Web, Créa/Conseils, Accessibilité
eZPublish, Drupal, Zend, Symfony

Geoff Bentley

Wednesday 08 April 2009 10:15:04 pm

Another option is to use the eZ Tika extension, which allows indexing of a large variety of binary file types like MsWord, MsOffice, PDF, Excel, ODF:

* http://projects.ez.no/eztika
* http://lucene.apache.org/tika/

Christian Rößler

Friday 08 May 2009 4:40:23 pm

As it seems there is a utf8 doublebyte character being interpreted as two 8-bit characters - which is not correct.

I think i have the same issue here (german umlauts and stuff like that) and found a promising site which explains the tomcat/solr charset settings:

http://wiki.apache.org/solr/SolrTomcat#head-20147ee4d9dd5ca83ed264898280ab60457847c4

Perhaps this is the issue. xml cannot have those characters in it. so the solr-xml-parser crashes. can you try it out? currently i am not able to get access to a ez-installation.

cheers,
christian

Hannover, Germany
eZ-Certified http://auth.ez.no/certification/verify/395613

eZ debug

Timing: Jan 18 2025 02:48:26
Script start
Timing: Jan 18 2025 02:48:26
Module start 'content'
Timing: Jan 18 2025 02:48:26
Module end 'content'
Timing: Jan 18 2025 02:48:26
Script end

Main resources:

Total runtime0.5693 sec
Peak memory usage4,096.0000 KB
Database Queries201

Timing points:

CheckpointStart (sec)Duration (sec)Memory at start (KB)Memory used (KB)
Script start 0.00000.0068 587.7344180.7891
Module start 'content' 0.00680.4300 768.5234614.7734
Module end 'content' 0.43680.1324 1,383.2969341.3750
Script end 0.5692  1,724.6719 

Time accumulators:

 Accumulator Duration (sec) Duration (%) Count Average (sec)
Ini load
Load cache0.00400.6943210.0002
Check MTime0.00140.2535210.0001
Mysql Total
Database connection0.00060.105210.0006
Mysqli_queries0.494386.82322010.0025
Looping result0.00230.40201990.0000
Template Total0.541295.120.2706
Template load0.00170.304120.0009
Template processing0.539594.760420.2697
Template load and register function0.00010.017510.0001
states
state_id_array0.00090.155610.0009
state_identifier_array0.00110.192520.0005
Override
Cache load0.00150.2576320.0000
Sytem overhead
Fetch class attribute can translate value0.00180.323440.0005
Fetch class attribute name0.00110.196060.0002
XML
Image XML parsing0.00110.197840.0003
class_abstraction
Instantiating content class attribute0.00000.002470.0000
General
dbfile0.00200.3552330.0001
String conversion0.00000.001030.0000
Note: percentages do not add up to 100% because some accumulators overlap

CSS/JS files loaded with "ezjscPacker" during request:

CacheTypePacklevelSourceFiles
CSS0extension/community/design/community/stylesheets/ext/jquery.autocomplete.css
extension/community_design/design/suncana/stylesheets/scrollbars.css
extension/community_design/design/suncana/stylesheets/tabs.css
extension/community_design/design/suncana/stylesheets/roadmap.css
extension/community_design/design/suncana/stylesheets/content.css
extension/community_design/design/suncana/stylesheets/star-rating.css
extension/community_design/design/suncana/stylesheets/syntax_and_custom_tags.css
extension/community_design/design/suncana/stylesheets/buttons.css
extension/community_design/design/suncana/stylesheets/tweetbox.css
extension/community_design/design/suncana/stylesheets/jquery.fancybox-1.3.4.css
extension/bcsmoothgallery/design/standard/stylesheets/magnific-popup.css
extension/sevenx/design/simple/stylesheets/star_rating.css
extension/sevenx/design/simple/stylesheets/libs/fontawesome/css/all.min.css
extension/sevenx/design/simple/stylesheets/main.v02.css
extension/sevenx/design/simple/stylesheets/main.v02.res.css
JS0extension/ezjscore/design/standard/lib/yui/3.17.2/build/yui/yui-min.js
extension/ezjscore/design/standard/javascript/jquery-3.7.0.min.js
extension/community_design/design/suncana/javascript/jquery.ui.core.min.js
extension/community_design/design/suncana/javascript/jquery.ui.widget.min.js
extension/community_design/design/suncana/javascript/jquery.easing.1.3.js
extension/community_design/design/suncana/javascript/jquery.ui.tabs.js
extension/community_design/design/suncana/javascript/jquery.hoverIntent.min.js
extension/community_design/design/suncana/javascript/jquery.popmenu.js
extension/community_design/design/suncana/javascript/jScrollPane.js
extension/community_design/design/suncana/javascript/jquery.mousewheel.js
extension/community_design/design/suncana/javascript/jquery.cycle.all.js
extension/sevenx/design/simple/javascript/jquery.scrollTo.js
extension/community_design/design/suncana/javascript/jquery.cookie.js
extension/community_design/design/suncana/javascript/ezstarrating_jquery.js
extension/community_design/design/suncana/javascript/jquery.initboxes.js
extension/community_design/design/suncana/javascript/app.js
extension/community_design/design/suncana/javascript/twitterwidget.js
extension/community_design/design/suncana/javascript/community.js
extension/community_design/design/suncana/javascript/roadmap.js
extension/community_design/design/suncana/javascript/ez.js
extension/community_design/design/suncana/javascript/ezshareevents.js
extension/sevenx/design/simple/javascript/main.js

Templates used to render the page:

UsageRequested templateTemplateTemplate loadedEditOverride
1node/view/full.tplfull/forum_topic.tplextension/sevenx/design/simple/override/templates/full/forum_topic.tplEdit templateOverride template
3content/datatype/view/ezimage.tpl<No override>extension/sevenx/design/simple/templates/content/datatype/view/ezimage.tplEdit templateOverride template
4content/datatype/view/ezxmltext.tpl<No override>extension/community_design/design/suncana/templates/content/datatype/view/ezxmltext.tplEdit templateOverride template
11content/datatype/view/ezxmltags/paragraph.tpl<No override>extension/ezwebin/design/ezwebin/templates/content/datatype/view/ezxmltags/paragraph.tplEdit templateOverride template
6content/datatype/view/ezxmltags/literal.tpl<No override>extension/community/design/standard/templates/content/datatype/view/ezxmltags/literal.tplEdit templateOverride template
3content/datatype/view/ezxmltags/line.tpl<No override>design/standard/templates/content/datatype/view/ezxmltags/line.tplEdit templateOverride template
1pagelayout.tpl<No override>extension/sevenx/design/simple/templates/pagelayout.tplEdit templateOverride template
 Number of times templates used: 29
 Number of unique templates used: 7

Time used to render debug report: 0.0001 secs