Archive for June, 2006

Geek Speak

| June 18th, 2006

If you eat, drink and speak geek you should have been here at Barcamp Pune on the 17th.

I didn’t have a chance to blog live at Barcamp Pune as I was busy organizing it and also trying to hack a firefox extension along with Shodhan. It proved to be quite an experience where I met some innovative individuals. We got a wonderful response for the extension (Recoja) we presented and also some useful feedback on how we could improvise it.

I particularly enjoyed two sessions:

  1. Ontology Based Search by Mukul: He explained how concepts could be extracted from documents hence aiding search engines in serving up relevant results.
  2. Naive Bayesian Text Classification by Atul: A wonderful session on the probablistic text classification. Unfortunately he could not go into much mathematical details due to lack of time.

If you couldn’t make it this time be sure to register yourself for the next one.

Barcamp Pune – Coding Live

| June 17th, 2006

Well I’m sitting here at Barcamp Pune listening to Atul Chitnis. I’m scheduled to present the the firefox extension called Recoja later during the day but we haven’t got it working yet. I and Shodhan plan to code it right here at Barcamp.

Using Lucene As A Database

| June 6th, 2006

Many atimes we index fields in a document which contribute only to classify/distinguish documents and not to its relevance. An analogy would be documents in a library. Here ‘category‘ could be the field which classifies the domain within which the document belongs. Therefore a typical text search query would go as:

content:neural network AND (category:biology OR category:AI)

Everything seems to work fine. Well not yet. In the above query we are trying to retrieve documents contaning the words ‘neural network‘. But if you look closely (try getting an explanation of the score in lucene), although the category sub-query seems to be used only for limiting the range of documents to particular domains, it contributes to the relevance as well.

So you must be wondering “How do I get documents from Biology or AI with ranking based on their relevance with ‘neural network’?”. Here is how. You dont need to hack around the lucene source code. All you have to do is to give a nullifying boost (thats a cool oxymoron (-;) to the respective sub-query. By nullifying boost I mean, a boost value so small that in effect nullifies the score of the sub-query (something like 0.00001). Therefore the revised query would look like:

content:neural network AND (category:biology OR category:AI)^0.00001

Thus although the category sub-query is a must match for a document, inorder to be a part of the resultset, it does not contribute to the score of the document. I like to term such queries non-relevant booleans. Non-relevant as it does not contribute to relevance and boolean as in the condition (AND or OR) as per which it must match.

This lets us harness the querying capabilities of a database from within a search engine.

[UPDATE] A nullifying boost of zero would be the ideal case wherein you don’t want the subquery to contribute to the score at all. A non-zero value for the same would give you more control over the subquery’s contribution to the score.

CAPTCHA This

| June 5th, 2006

Heres a cool CAPTCHA I came across at an IBM Developerworks Blog:

CAPTCHA

So brush up your mathematics before plan to comment (-;