Archive for the Lucene Category

Using Lucene As A Database

| June 6th, 2006

Many atimes we index fields in a document which contribute only to classify/distinguish documents and not to its relevance. An analogy would be documents in a library. Here ‘category‘ could be the field which classifies the domain within which the document belongs. Therefore a typical text search query would go as:

content:neural network AND (category:biology OR category:AI)

Everything seems to work fine. Well not yet. In the above query we are trying to retrieve documents contaning the words ‘neural network‘. But if you look closely (try getting an explanation of the score in lucene), although the category sub-query seems to be used only for limiting the range of documents to particular domains, it contributes to the relevance as well.

So you must be wondering “How do I get documents from Biology or AI with ranking based on their relevance with ‘neural network’?”. Here is how. You dont need to hack around the lucene source code. All you have to do is to give a nullifying boost (thats a cool oxymoron (-;) to the respective sub-query. By nullifying boost I mean, a boost value so small that in effect nullifies the score of the sub-query (something like 0.00001). Therefore the revised query would look like:

content:neural network AND (category:biology OR category:AI)^0.00001

Thus although the category sub-query is a must match for a document, inorder to be a part of the resultset, it does not contribute to the score of the document. I like to term such queries non-relevant booleans. Non-relevant as it does not contribute to relevance and boolean as in the condition (AND or OR) as per which it must match.

This lets us harness the querying capabilities of a database from within a search engine.

[UPDATE] A nullifying boost of zero would be the ideal case wherein you don’t want the subquery to contribute to the score at all. A non-zero value for the same would give you more control over the subquery’s contribution to the score.

Range Search In Lucene

| January 13th, 2006

You’ll be quite surprised to find out about how Lucene actually expands your range queries. As pointed out by Simon, range queries are enumerated for every possible value in the given range. Now ain’t that naive >-:. This limits the range to about 1024 values. Simon also points out a possible solution for dates by indexing them as strings of the form ‘yyyymmdd’.

I tried doing the same on one of my recent projects where I was indexing dates as strings ‘yyyymmdd’. But when I actually had a look at my expanded query via Limo, I found Lucene enumerating for string range queries as well.

Apparently this is not a bug nor even a feature but a “known behaviour”.