Solr Indexing Error

Author Message

Sylvain Gogel

Tuesday 07 April 2009 5:04:14 am

Hi there running ezfind2 indexation i notice some data are not indexed >_<

Doing some digging i found out that Solr::addDocs() got some serious issues

   function addDocs ( $docs = array(), $commit = true, $optimize = false  )
    {
        //
        if (! is_array( $docs ) )
        {
        	echo("docs is not an array\n");
            return false;
        }
        if ( count ( $docs ) == 0)
        {
        	echo("docs is empty\n");
        	return false;
        }
        else
        {
            $postString = '<add>';
            foreach ( $docs as $doc )
            {
                $postString .= $doc->docToXML();
            }
            $postString .= '</add>';
            
			//echo($postString."\n");
			
            $updateResult = $this->postQuery ( '/update', $postString, 'text/xml' );
			echo $updateResult;

This last echo output some java errors:

<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"/>
<title>Error 500 </title>
</head>
<body><h2>HTTP ERROR: 500</h2><pre>ParseError at [row,col]:[25,1]
Message: An invalid XML character (Unicode: 0xc) was found in the element content of the document.

javax.xml.stream.XMLStreamException: ParseError at [row,col]:[25,1]
Message: An invalid XML character (Unicode: 0xc) was found in the element content of the document.
  at com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.next(XMLStreamReaderImpl.java:588)
  at org.apache.solr.handler.XmlUpdateRequestHandler.readDoc(XmlUpdateRequestHandler.java:321)
  at org.apache.solr.handler.XmlUpdateRequestHandler.processUpdate(XmlUpdateRequestHandler.java:195)
  at org.apache.solr.handler.XmlUpdateRequestHandler.handleRequestBody(XmlUpdateRequestHandler.java:123)
  at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
  at org.apache.solr.core.SolrCore.execute(SolrCore.java:1204)
  at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:303)
  at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:232)
  at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
  at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
  at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
  at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
  at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
  at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
  at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
  at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
  at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
  at org.mortbay.jetty.Server.handle(Server.java:285)
  at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
  at org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:835)
  at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641)
  at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:202)
  at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
  at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
  at org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)
</pre>
<p>RequestURI=/solr/update</p><p><i><small><a href="http://jetty.mortbay.org/">Powered by Jetty://</a></small></i></p><br/>

Obvisouly the generated xml is not parsable and the resulting content is not indexed !
The content object contains binary pdf files and images.

Anyone got a fix for EzFind stable?

--
http://www.ecedi.fr
Agence Web, Créa/Conseils, Accessibilité
eZPublish, Drupal, Zend, Symfony

Sylvain Gogel

Tuesday 07 April 2009 5:12:41 am

I use both

[PDFHandlerSettings]
TextExtractionTool=pstotext

and

[PDFHandlerSettings]
TextExtractionTool=mypdftotext

the last is a shell script based on xpdf tool pdftotext

#!/bin/sh
/usr/bin/pdftotext -enc "UTF-8" $1 -

--
http://www.ecedi.fr
Agence Web, Créa/Conseils, Accessibilité
eZPublish, Drupal, Zend, Symfony

Geoff Bentley

Wednesday 08 April 2009 10:15:04 pm

Another option is to use the eZ Tika extension, which allows indexing of a large variety of binary file types like MsWord, MsOffice, PDF, Excel, ODF:

* http://projects.ez.no/eztika
* http://lucene.apache.org/tika/

Christian Rößler

Friday 08 May 2009 4:40:23 pm

As it seems there is a utf8 doublebyte character being interpreted as two 8-bit characters - which is not correct.

I think i have the same issue here (german umlauts and stuff like that) and found a promising site which explains the tomcat/solr charset settings:

http://wiki.apache.org/solr/SolrTomcat#head-20147ee4d9dd5ca83ed264898280ab60457847c4

Perhaps this is the issue. xml cannot have those characters in it. so the solr-xml-parser crashes. can you try it out? currently i am not able to get access to a ez-installation.

cheers,
christian

Hannover, Germany
eZ-Certified http://auth.ez.no/certification/verify/395613

Powered by eZ Publish™ CMS Open Source Web Content Management. Copyright © 1999-2014 eZ Systems AS (except where otherwise noted). All rights reserved.

eZ debug

Timing: Jan 18 2025 10:35:56
Script start
Timing: Jan 18 2025 10:35:56
Module start 'layout'
Timing: Jan 18 2025 10:35:56
Module start 'content'
Timing: Jan 18 2025 10:35:57
Module end 'content'
Timing: Jan 18 2025 10:35:57
Script end

Main resources:

Total runtime0.5349 sec
Peak memory usage4,096.0000 KB
Database Queries64

Timing points:

CheckpointStart (sec)Duration (sec)Memory at start (KB)Memory used (KB)
Script start 0.00000.0068 587.9375152.6250
Module start 'layout' 0.00680.0037 740.562539.4453
Module start 'content' 0.01040.5230 780.0078609.6953
Module end 'content' 0.53350.0014 1,389.703116.1641
Script end 0.5349  1,405.8672 

Time accumulators:

 Accumulator Duration (sec) Duration (%) Count Average (sec)
Ini load
Load cache0.00350.6573160.0002
Check MTime0.00140.2539160.0001
Mysql Total
Database connection0.00090.163910.0009
Mysqli_queries0.473188.4509640.0074
Looping result0.00080.1530620.0000
Template Total0.500093.520.2500
Template load0.00220.413520.0011
Template processing0.497893.059920.2489
Template load and register function0.00010.023610.0001
states
state_id_array0.00060.105210.0006
state_identifier_array0.00120.226820.0006
Override
Cache load0.00180.3424320.0001
Sytem overhead
Fetch class attribute can translate value0.00090.167930.0003
Fetch class attribute name0.00120.231560.0002
XML
Image XML parsing0.00160.297630.0005
class_abstraction
Instantiating content class attribute0.00000.003070.0000
General
dbfile0.00180.3385220.0001
String conversion0.00000.001940.0000
Note: percentages do not add up to 100% because some accumulators overlap

Templates used to render the page:

UsageRequested templateTemplateTemplate loadedEditOverride
1node/view/full.tplfull/forum_topic.tplextension/sevenx/design/simple/override/templates/full/forum_topic.tplEdit templateOverride template
3content/datatype/view/ezimage.tpl<No override>extension/sevenx/design/simple/templates/content/datatype/view/ezimage.tplEdit templateOverride template
4content/datatype/view/ezxmltext.tpl<No override>extension/community_design/design/suncana/templates/content/datatype/view/ezxmltext.tplEdit templateOverride template
11content/datatype/view/ezxmltags/paragraph.tpl<No override>extension/ezwebin/design/ezwebin/templates/content/datatype/view/ezxmltags/paragraph.tplEdit templateOverride template
6content/datatype/view/ezxmltags/literal.tpl<No override>extension/community/design/standard/templates/content/datatype/view/ezxmltags/literal.tplEdit templateOverride template
3content/datatype/view/ezxmltags/line.tpl<No override>design/standard/templates/content/datatype/view/ezxmltags/line.tplEdit templateOverride template
1print_pagelayout.tpl<No override>extension/community/design/community/templates/print_pagelayout.tplEdit templateOverride template
 Number of times templates used: 29
 Number of unique templates used: 7

Time used to render debug report: 0.0001 secs