Wednesday 05 March 2008 4:40:35 pm
OK, well I got the old PDF indexing working using pdftotext ... much the same mechanism as good old htdig. eZ Find seems to be the way to go, even though it means adding a JRE to the server to run it. We'll be indexing quite large PDF files (historical newspaper collection). I'm yet to run a trial, but I'm wondering how eZFind works at the database end and whether potentially very large chunks of data will cause problems, as seemed to happen with the old system?
Also, does anyone know if eZ Find uses some kind of stop list (list of common functional words with little semantic content like "the, a, and, it" etc)? At a glance, the older system doesn't seem to filter these out. These words are among the most commonly used, and filtering them out of results can significantly reduce the amount of data needing processing without having much impact on the effectiveness of a search.