eZ Find / Solr config for partial search on latin chars

Author Message

Bjørnar Grøtterud

Friday 13 May 2011 8:16:29 am

Hi,

We have an install with eZ find 2.3 configured with ReversedWildcardFilterFactory. It works good on partial wildcard searches, but we still have issues when searching for words that starts with an latin character, when the query is to, three or four characters long.

An example:

We have indexed an object named Ølbolle

Query - result:

øl - no result

ølb - no result

ølbo - no result

ølbol - no result

ølboll - ølbolle

ølbolle - ølbolle

When searchin for word starting with non-latin characters, this is not an issue.

Our indez-analyser is set up as follows:

<analyzer type="index">

<tokenizer class="solr.WhitespaceTokenizerFactory"/>


<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />


<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>


<filter class="solr.LowerCaseFilterFactory"/>
<filter class="ISOLatin1AccentFilterFactory"/>

<filter class="solr.ReversedWildcardFilterFactory" withOriginal="true" maxPosAsterisk="3" maxPosQuestion="2" maxFractionAsterisk="0.33"/>
</analyzer>

Any experience / suggestions greatly appreciated.

Paul Borgermans

Monday 16 May 2011 11:58:44 am

Hello Bjørnar

I guess you mean that using normal wildcards (asterix) after the characters which are added in your template logic

The culprit is actually the ISOLatin filter factory: when doing wildcard searches, the analysis steps are not done.There is still no stable resolution in Solr (where the "problem" actually is). The same for lowercasing by the way.

If you really want to rely so much on wildcards (which I dont recommend either actually), best is to remove the ISOlatin "normalisation" as well (both index and query analysis steps)

Furthermore, I think you did not remove the stemming step for the query part of your text field type in schema.xml (otherwise you should not get a match for ølboll). You should absolutely remove the stemming part there as well.

It would be good that you email me your schema.xml for closer inspection

Cheers

Paul

eZ Publish, eZ Find, Solr expert consulting and training
http://twitter.com/paulborgermans

Bjørnar Grøtterud

Thursday 19 May 2011 12:42:11 am

Hi Paul,

thanks for your reply.

The reason why we have added wildcards, is that we found this as the only solution for partial search for both first and last part of a word. It actually works well, exept for latin chars.

Ill try your suggestion on removing the ISOLatin1AccentFilterFactory.

We have already removed the SnowballPorterFilterFactory from the query part in our schema.xml.

Ill send you our schema.xml, thanks for taking a look!

Best regards

Bjørnar

Powered by eZ Publish™ CMS Open Source Web Content Management. Copyright © 1999-2014 eZ Systems AS (except where otherwise noted). All rights reserved.

eZ debug

Timing: Jan 18 2025 02:57:24
Script start
Timing: Jan 18 2025 02:57:24
Module start 'layout'
Timing: Jan 18 2025 02:57:24
Module start 'content'
Timing: Jan 18 2025 02:57:25
Module end 'content'
Timing: Jan 18 2025 02:57:25
Script end

Main resources:

Total runtime1.0327 sec
Peak memory usage4,096.0000 KB
Database Queries57

Timing points:

CheckpointStart (sec)Duration (sec)Memory at start (KB)Memory used (KB)
Script start 0.00000.0082 588.1406152.6406
Module start 'layout' 0.00820.0034 740.781339.4844
Module start 'content' 0.01161.0196 780.2656541.8594
Module end 'content' 1.03130.0014 1,322.125012.1250
Script end 1.0327  1,334.2500 

Time accumulators:

 Accumulator Duration (sec) Duration (%) Count Average (sec)
Ini load
Load cache0.00360.3508160.0002
Check MTime0.00150.1433160.0001
Mysql Total
Database connection0.00250.237510.0025
Mysqli_queries0.976394.5353570.0171
Looping result0.00050.0483550.0000
Template Total1.002797.120.5013
Template load0.00190.183920.0009
Template processing1.000796.904420.5004
Template load and register function0.00010.013010.0001
states
state_id_array0.00150.141810.0015
state_identifier_array0.00100.097520.0005
Override
Cache load0.00170.1621380.0000
Sytem overhead
Fetch class attribute can translate value0.00050.053120.0003
Fetch class attribute name0.00090.084140.0002
XML
Image XML parsing0.00060.061520.0003
class_abstraction
Instantiating content class attribute0.00000.000940.0000
General
dbfile0.00060.0627170.0000
String conversion0.00000.003540.0000
Note: percentages do not add up to 100% because some accumulators overlap

Templates used to render the page:

UsageRequested templateTemplateTemplate loadedEditOverride
1node/view/full.tplfull/forum_topic.tplextension/sevenx/design/simple/override/templates/full/forum_topic.tplEdit templateOverride template
3content/datatype/view/ezxmltext.tpl<No override>extension/community_design/design/suncana/templates/content/datatype/view/ezxmltext.tplEdit templateOverride template
5content/datatype/view/ezxmltags/paragraph.tpl<No override>extension/ezwebin/design/ezwebin/templates/content/datatype/view/ezxmltags/paragraph.tplEdit templateOverride template
1content/datatype/view/ezxmltags/strong.tpl<No override>design/standard/templates/content/datatype/view/ezxmltags/strong.tplEdit templateOverride template
1content/datatype/view/ezxmltags/literal.tpl<No override>extension/community/design/standard/templates/content/datatype/view/ezxmltags/literal.tplEdit templateOverride template
1content/datatype/view/ezimage.tpl<No override>extension/sevenx/design/simple/templates/content/datatype/view/ezimage.tplEdit templateOverride template
1print_pagelayout.tpl<No override>extension/community/design/community/templates/print_pagelayout.tplEdit templateOverride template
 Number of times templates used: 13
 Number of unique templates used: 7

Time used to render debug report: 0.0002 secs