Problem with really BIG solr indices

Author Message

Xavier Serna

Thursday 19 March 2009 4:52:13 am

Hi all guys,

let's try to explain the problem encountered with eZFind 2.0.0 (also with previous versions).

Background: we've indexed in the solr engine about 30k eZContentObjects, also with 226k external XML files, this generates about 5 GB of index data.

Currently we have started the solr engine with this command:

/usr/bin/java -server -Xmx600m -Xms600m -XX:+UseParallelGC -XX:+AggressiveOpts -XX:NewRatio=5 -jar start.jar

Our main problem, is that every time that a commit is executed in the solr engine, all index files (remember, 5GB data) are regenerated from scratch, the size of the data folder grows up until about 10 GB, then the old files are gone and the new one remain in the data folder.
As we have delayed indexing enabled, this is not a critical problem on publishing content, but it is when deleting something, as every time we remove some object the system freezes until indices are regenerated.

Anyone out there with similar escenario that can guide us?

Thanks for reading!
Xavier

--
Xavier Serna
eZ Publish Certified Developer
Departament de Software
Microblau S.L. - http://www.microblau.net
+34 937 466 205

Ali Nebi

Thursday 09 April 2009 6:50:18 am

Hi,

we just made some tests with ezfind2 and we found the same problem. The solr indexes took 650GB. This is really big. The same solr indexes with ezfind1 and related solr is 9,5GB.

Why this happen and how to solve this problem?

Thanks in advanced!

Iguana Information Technologies, SL - http://www.iguanait.com

Nicolas Pastorino

Friday 10 April 2009 12:32:55 am

@Xavier :
Any feedback on your issue ? Did the proposed solution of disabling the OptimizeOnCommit directive + setting up a daily 'optimize' workflow work ?

@Ali :
This index size is very surprising. Did the indexed content base grow a lot between the eZ Find 1.x usage and eZ Find 2.0 ? Are external elements indexed ( through the DataImportHandler Solr extension for instance ? ) too ? Websites crawled ?

Best regards,

--
Nicolas Pastorino
Director Community - eZ
Member of the Community Project Board

eZ Publish Community on twitter: http://twitter.com/ezcommunity

t : http://twitter.com/jeanvoye
G+ : http://plus.tl/jeanvoye

Ali Nebi

Wednesday 15 April 2009 8:20:04 am

Sorry for my late reply.

We use the same database for tests and the data in database is not changing. Also we don't index any external elements.

We continue to do tests with this. We test in one other test server and there the size of data dir was less than the other server, where it was 650GB, but it is still big. 14GB for 40% indexed data.

Regards, Ali Nebi!

Iguana Information Technologies, SL - http://www.iguanait.com

Xavier Serna

Thursday 16 April 2009 12:50:04 am

Hi Nico,

many thanks for your proposed solution, it seems to work fine now disabling optimizeoncommit.
Only one detail, in the updatesearchindexsolr.php on each commit, every 1000 objects, it's forced an optimize, not respecting the setting in the ini file. I believe that this should be updated, because reindexation of the whole xml files takes more than 4 hours.

thanks!

--
Xavier Serna
eZ Publish Certified Developer
Departament de Software
Microblau S.L. - http://www.microblau.net
+34 937 466 205

Ali Nebi

Monday 01 June 2009 4:56:57 am

Hi,

after some more tests and spending more time for ezfind 2 tests, we found out why the solr indices were so big.

First we needed to use userFork to false. The real problem was explained here from Denitsa M.:

http://ez.no/developer/forum/extensions/ez_find/ezfind2_indexing_speed_incredibly_low_er

When the indexing start to index objects that have relationlist attribute(s), then indexing loops between these objects and indices are getting bigger and bigger. When we did these attributes no searchable, then for 2 GB database indexing was much faster and the indices size was hundred of MB.

Regards, Ali Nebi!

Iguana Information Technologies, SL - http://www.iguanait.com

Powered by eZ Publish™ CMS Open Source Web Content Management. Copyright © 1999-2014 eZ Systems AS (except where otherwise noted). All rights reserved.

eZ debug

Timing: Jan 18 2025 02:07:42
Script start
Timing: Jan 18 2025 02:07:42
Module start 'layout'
Timing: Jan 18 2025 02:07:42
Module start 'content'
Timing: Jan 18 2025 02:07:43
Module end 'content'
Timing: Jan 18 2025 02:07:43
Script end

Main resources:

Total runtime0.5535 sec
Peak memory usage4,096.0000 KB
Database Queries70

Timing points:

CheckpointStart (sec)Duration (sec)Memory at start (KB)Memory used (KB)
Script start 0.00000.0052 588.0625152.6406
Module start 'layout' 0.00520.0028 740.703139.4766
Module start 'content' 0.00800.5441 780.1797630.6641
Module end 'content' 0.55200.0014 1,410.843820.1328
Script end 0.5535  1,430.9766 

Time accumulators:

 Accumulator Duration (sec) Duration (%) Count Average (sec)
Ini load
Load cache0.00290.5309160.0002
Check MTime0.00120.2208160.0001
Mysql Total
Database connection0.00080.135710.0008
Mysqli_queries0.485587.7113700.0069
Looping result0.00060.1112680.0000
Template Total0.525294.920.2626
Template load0.00190.350320.0010
Template processing0.523294.527520.2616
Template load and register function0.00010.018610.0001
states
state_id_array0.00100.171810.0010
state_identifier_array0.00100.181620.0005
Override
Cache load0.00160.2966470.0000
Sytem overhead
Fetch class attribute can translate value0.00050.088030.0002
Fetch class attribute name0.00120.215090.0001
XML
Image XML parsing0.00240.433330.0008
class_abstraction
Instantiating content class attribute0.00000.0044120.0000
General
dbfile0.00200.3623280.0001
String conversion0.00000.001640.0000
Note: percentages do not add up to 100% because some accumulators overlap

Templates used to render the page:

UsageRequested templateTemplateTemplate loadedEditOverride
1node/view/full.tplfull/forum_topic.tplextension/sevenx/design/simple/override/templates/full/forum_topic.tplEdit templateOverride template
6content/datatype/view/ezimage.tpl<No override>extension/sevenx/design/simple/templates/content/datatype/view/ezimage.tplEdit templateOverride template
6content/datatype/view/ezxmltext.tpl<No override>extension/community_design/design/suncana/templates/content/datatype/view/ezxmltext.tplEdit templateOverride template
10content/datatype/view/ezxmltags/paragraph.tpl<No override>extension/ezwebin/design/ezwebin/templates/content/datatype/view/ezxmltags/paragraph.tplEdit templateOverride template
1content/datatype/view/ezxmltags/literal.tpl<No override>extension/community/design/standard/templates/content/datatype/view/ezxmltags/literal.tplEdit templateOverride template
5content/datatype/view/ezxmltags/line.tpl<No override>design/standard/templates/content/datatype/view/ezxmltags/line.tplEdit templateOverride template
1print_pagelayout.tpl<No override>extension/community/design/community/templates/print_pagelayout.tplEdit templateOverride template
 Number of times templates used: 30
 Number of unique templates used: 7

Time used to render debug report: 0.0003 secs