Using Lucene As A Database

| June 6th, 2006

Many atimes we index fields in a document which contribute only to classify/distinguish documents and not to its relevance. An analogy would be documents in a library. Here ‘category‘ could be the field which classifies the domain within which the document belongs. Therefore a typical text search query would go as:

content:neural network AND (category:biology OR category:AI)

Everything seems to work fine. Well not yet. In the above query we are trying to retrieve documents contaning the words ‘neural network‘. But if you look closely (try getting an explanation of the score in lucene), although the category sub-query seems to be used only for limiting the range of documents to particular domains, it contributes to the relevance as well.

So you must be wondering “How do I get documents from Biology or AI with ranking based on their relevance with ‘neural network’?”. Here is how. You dont need to hack around the lucene source code. All you have to do is to give a nullifying boost (thats a cool oxymoron (-;) to the respective sub-query. By nullifying boost I mean, a boost value so small that in effect nullifies the score of the sub-query (something like 0.00001). Therefore the revised query would look like:

content:neural network AND (category:biology OR category:AI)^0.00001

Thus although the category sub-query is a must match for a document, inorder to be a part of the resultset, it does not contribute to the score of the document. I like to term such queries non-relevant booleans. Non-relevant as it does not contribute to relevance and boolean as in the condition (AND or OR) as per which it must match.

This lets us harness the querying capabilities of a database from within a search engine.

[UPDATE] A nullifying boost of zero would be the ideal case wherein you don’t want the subquery to contribute to the score at all. A non-zero value for the same would give you more control over the subquery’s contribution to the score.

  • Daniel

    Why don’t you just use bla^0 instead of that small number?

  • Anand Kishore

    Daniel,

    Zero would be the ideal case wherein you want to completely eliminate the subquery from contributing to the score. Thanks for pointing it out as I forgot to mention it. The relatively small number for the boost gives you enough control to decide how much a given subquery should contribute to the score which might not be the ideal case (zero) always.