Accented characters are not working in solr search

Author Message

Praveen Kumar

Tuesday 16 August 2011 5:00:23 pm

Hi, 
This is Praveen. I am using apache-solr in our project to support search on cities. I having a problem with the accented characters while searching. 
For example: 
My city name is 'vrély'. 
if i search for 'vr*', it is giving the result. 
But if i search for 'vrél*', it is not giving any results.  
But if i search without accented characters like 'vre*', it again give results. 
My city field type is "text" and my schema.xml for this as follows: 
        <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
            <analyzer type="index">
                <tokenizer class="solr.WhitespaceTokenizerFactory"/>
                
                
                <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/>
                <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
                <filter class="solr.LowerCaseFilterFactory"/>
                <filter class="solr.ASCIIFoldingFilterFactory"/>
                <filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/>
            </analyzer>
            <analyzer type="query">
                <tokenizer class="solr.WhitespaceTokenizerFactory"/>
                <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
                <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/>
                <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
                <filter class="solr.LowerCaseFilterFactory"/>
                <filter class="solr.ASCIIFoldingFilterFactory"/>
                <filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/>
            </analyzer>
        </fieldType>
Any suggestions or solution to resolve my problem is appreciable. 
Thanks in Advance... 
Regards, 
Praveen Kumar 

Ivo Lukac

Wednesday 17 August 2011 12:59:15 am

There could be 2 things:

- either your index and query analyzer are not the same (e.g. there is a small difference: catenateWords="0" catenateNumbers="0") so tokens are not the same in both situations or

- the "é" character is somehow badly encoded when sent to solr as a query

I had a similar problem before when I used jetty, it didn't support utf-8 queries very well. I switched to tomcat. Could be that jetty resolved those issues in newer version, I didn't check.

Anyway, you need to be aware that "vrély" is always tokenized as "vrely", that is why you are finding it with vr* and vre*

http://www.linkedin.com/in/ivolukac
http://www.netgen.hr/eng/blog
http://twitter.com/ilukac

Philippe VINCENT-ROYOL

Wednesday 17 August 2011 1:24:20 am

Just a question : which version of solr do you use? 

Certified Developer (4.1): http://auth.ez.no/certification/verify/272607
Certified Developer (4.4): http://auth.ez.no/certification/verify/377321

G+ : http://plus.tl/dspe
Twitter : http://twitter.com/dspe

Powered by eZ Publish™ CMS Open Source Web Content Management. Copyright © 1999-2014 eZ Systems AS (except where otherwise noted). All rights reserved.

eZ debug

Timing: Jan 18 2025 02:59:24
Script start
Timing: Jan 18 2025 02:59:24
Module start 'layout'
Timing: Jan 18 2025 02:59:24
Module start 'content'
Timing: Jan 18 2025 02:59:25
Module end 'content'
Timing: Jan 18 2025 02:59:25
Script end

Main resources:

Total runtime0.9001 sec
Peak memory usage4,096.0000 KB
Database Queries59

Timing points:

CheckpointStart (sec)Duration (sec)Memory at start (KB)Memory used (KB)
Script start 0.00000.0055 588.0469152.6406
Module start 'layout' 0.00550.0040 740.687539.5000
Module start 'content' 0.00950.8892 780.1875595.1641
Module end 'content' 0.89870.0013 1,375.351616.1250
Script end 0.9000  1,391.4766 

Time accumulators:

 Accumulator Duration (sec) Duration (%) Count Average (sec)
Ini load
Load cache0.00320.3504160.0002
Check MTime0.00130.1480160.0001
Mysql Total
Database connection0.00070.075210.0007
Mysqli_queries0.848594.2752590.0144
Looping result0.00050.0513570.0000
Template Total0.869896.620.4349
Template load0.00180.197120.0009
Template processing0.868096.443520.4340
Template load and register function0.00010.015810.0001
states
state_id_array0.00080.084210.0008
state_identifier_array0.00160.182420.0008
Override
Cache load0.00150.1666480.0000
Sytem overhead
Fetch class attribute can translate value0.00040.046230.0001
Fetch class attribute name0.00080.089350.0002
XML
Image XML parsing0.00110.125030.0004
class_abstraction
Instantiating content class attribute0.00000.001250.0000
General
dbfile0.00200.2245240.0001
String conversion0.00000.001140.0000
Note: percentages do not add up to 100% because some accumulators overlap

Templates used to render the page:

UsageRequested templateTemplateTemplate loadedEditOverride
1node/view/full.tplfull/forum_topic.tplextension/sevenx/design/simple/override/templates/full/forum_topic.tplEdit templateOverride template
3content/datatype/view/ezxmltext.tpl<No override>extension/community_design/design/suncana/templates/content/datatype/view/ezxmltext.tplEdit templateOverride template
3content/datatype/view/ezxmltags/line.tpl<No override>design/standard/templates/content/datatype/view/ezxmltags/line.tplEdit templateOverride template
2content/datatype/view/ezxmltags/strong.tpl<No override>design/standard/templates/content/datatype/view/ezxmltags/strong.tplEdit templateOverride template
3content/datatype/view/ezxmltags/paragraph.tpl<No override>extension/ezwebin/design/ezwebin/templates/content/datatype/view/ezxmltags/paragraph.tplEdit templateOverride template
2content/datatype/view/ezimage.tpl<No override>extension/sevenx/design/simple/templates/content/datatype/view/ezimage.tplEdit templateOverride template
1print_pagelayout.tpl<No override>extension/community/design/community/templates/print_pagelayout.tplEdit templateOverride template
 Number of times templates used: 15
 Number of unique templates used: 7

Time used to render debug report: 0.0001 secs