Lucene in Action

Upgrading example code for Lucene 2.0

To bring the LIA code examples up to date with Lucene 2.x API there are only a few minor changes necessary: Replace all BooleanQuery.add's, e.g.:

    -      subjectQuery.add(tq, false, false); 
    +      subjectQuery.add(tq, BooleanClause.Occur.SHOULD);

Substitute RangeFilter for DateFilter usage, e.g.:

    -    DateFilter filter = new DateFilter("modified", jan1, dec31); 
    +    RangeFilter filter = new RangeFilter("modified", jan1,   dec31, true, true);

NOTE: The dates are now String's generated by DateUtils.dateToString() and incompatible with DateField

Replace all Field.Keyword/UnStored/Text/UnIndexed with the enumerated types, e.g.:

    -      doc.add(Field.Keyword("animal", animal)); 
    +      doc.add(new Field("animal", animal, Field.Store.YES,   Field.Index.UN_TOKENIZED));

Rename PhrasePrefixQuery -> MultiPhraseQuery

Use instance of QueryParser instead of static parse method, e.g.:

    -        Query query = QueryParser.parse(expression, "contents",   analyzer); 
    +        Query query = new QueryParser("contents", analyzer).parse(expression)

QueryParser subclasses adjusted for overridden getXXXQuery method signatures.

IndexReader.delete() updated to be IndexReader.deleteDocument/.deleteDocuments()

IndexWriter internal configuration values now accessed through getters/setters rather than the fields directly, with minMergeDocs renamed as setMaxBufferedDocs().

QueryParser.setLowercaseWildcardTerms() replaced with .setLowercaseExpandedTerms()

QueryParser.getRangeQuery() still uses DateField when constructing a RangeQuery. If your index is built using DateTools, you will need to subclass and override, as shown in QueryParserTest.testRangeQuery().

Posted on Tue, 22 May 2007 13:22

NPE in PDFBoxPDFHandler

The book has a coding error in PDFBoxPDFHandler.java on page 237, between the notes (5) and (6).

The problem is where PDFBoxPDFHandler.java sets pdDoc to null, then immediately uses pdDoc. Ths source code download has been corrected with this fix on line 86:

pdDoc = new PDDocument(cosDoc); Thanks to Bill Gibson for reminding us to add this to the official errata list.

Posted on Mon, 20 Mar 2006 13:23

Memory leak in custom sort code

Brian Riddle e-mailed us quite a detailed errata item, and with his permission I'm posting the e-mail in its entirety in order to preserve the details:

Hello,
First *huge* thanks for your book Lucene In Action between it and the lucene develepers and user mailing lists i have been able to give our site a much better search infrastructure.

In the last phase of rolling out our new search system we discovered a memory leak in listing 6.2 DistanceComparatorSource. I used that code as a base for a modified integer sort. That was in and of it self pretty straight forward. But the problem was there was no equals and hash code method. That means that equals and hashcode are inherited from object for DistanceScoreDocLookupComparator.

And there in lies the memory leak. Everytime a new DisctanceComparatorSource was retrieved it failed to find the cached value ScoreDocComparator. So it added it to the cache of ScoreDocCompatators kept by o.a.l.s.FieldCacheImpl. The fix was to add a hashcode and equals method to ou ScoreDocCompatator implementation.

The big clue came after using www.yourkit.com's profiler to see what was allocating so much memory and reading the last paragraph on page 199 a couple of times.

"The sorting infrastructure within Lucene caches (based on a key combining the hashcode of the indexReader, the field name, and the custom sort object) ..."

That sentence gave the clue as to what was happening but it is also a little misleading. Looking at o.a.l.s.FieldCacheImpl The index reader is used as the key for the internal WeakHashMap of the different Entry(fieldName, ScoreDocComparator) that are used in an application.

If implementations of ScoreDocComapartors do not implment hashcode and equals correctly every time they are used they will be added to the internal cache of field/comparators.

This was complete my fault as I usually add the to every class i write, not however in this case. I hope you can add this to the errata for the currrent addition (as well as fix the code) and expand on this in the Second addition so others won't be bitten by this bug.

Thanks again for the book you guys *rock*.

PS. We are using lucene-1.4.3.jar /jsdk 1.4.2 & jre 1.5 solaris and linux

Posted on Wed, 1 Mar 2006 04:03

Incorrect figure reference

On page 406 (under Term Positions) the second paragraph refers to figure B.4. The reference should be to figure B.3. There is no figure B.4. Thanks to Ira Goldstein for reporting this issue.

Posted on Sun, 20 Nov 2005 09:49

Incorrect figure reference

On page 30, where it says "...you'll notice that figure 2.1 and figure 7.3 resemble each other", the reference to figure 7.3 is incorrect; it should refer to the figure on page 256, which is figure 7.1 Thanks to Ira Goldstein, via the Manning Author Online forum, for reporting this issue.

Posted on Sun, 20 Nov 2005 09:46

Two Pseudocode fixes

Page 51 contains pseudocode with 2 mistakes. Instead of fsWriter.addIndexes( Directory[] {ramDir} ); It should be: fsWriter.addIndexes( new Directory[] {ramDir} ); Also, there is a line that contains: ramWriter.close(); This call should be moved up one line, to come before addIndexes call mentioned above. This close() call needs to execute first, in order for newly added documents to get flushed before index merge. Otherwise, not all documents added to that ramWriter will be added to fsWriter.

Posted on Mon, 19 Sep 2005 22:04

CachingWrapperingFilter

Our keen Korean translator, Cheolgoo Kang, pointed out that we misspelled CachingWrapperFilter. There are three places in the book, easily located using our handy search engine, where we incorrectly used CachingWrappingFilter instead of the correct CachingWrapperFilter.

Posted on Sun, 18 Sep 2005 06:08

Document and Field boost setting

Pages 38 and 39, boosts should be set as flots: doc.setBoost(1.5f); doc.setBoost(0.1f); subjectField.setBoost(1.2f);

Posted on Sat, 13 Aug 2005 15:07

SAXXMLHandler attributeMap initialization

Page 228, in SAXXMLHandler.java, the attributeMap instance variable should be initialized as: private HashMap attributeMap = new HashMap();

Posted on Mon, 13 Jun 2005 11:06

unintended whitespace

Page 85, in the code marked by bullet #3, there is additional whitespace in "searchingBook s". It should be "searchingBooks".

Posted on Fri, 3 Jun 2005 08:43

Caveats that apply

Typo on page 99: "The same performance caveats that apply..."

Posted on Fri, 3 Jun 2005 08:38

Of course

The paragraph that begins with "During indexing..." has a typo - it should read "...even this per-Document analysis is too coarse grained." instead of "course".

Posted on Thu, 2 Jun 2005 08:21

Alluded to...

On page 370, first paragraph, it should read "As alluded to..." rather than "eluded".

Posted on Thu, 2 Jun 2005 08:15

Scoring formula figure omission

The scoring formula shown in figure 3.1 is incomplete. The correct scoring formula (highlighting the omission), directly from Lucene's Similarity class javadoc, is shown here:

This formula was created as two graphics because of its length, and we neglected to incorporate the second part of the graphic into the manuscript layout.

Posted on Mon, 24 Jan 2005 22:04

Indexer command-line example whitespace issue

In section 1.4.1, sub-section Running Indexer, the command-line example appears to only be passing a single argument to Indexer. However, there should be a space between build/index and /lucene. The full command-line is:

% java lia.meetlucene.Indexer build/index /lucene

Posted on Thu, 6 Jan 2005 11:14