Archive for January, 2006

Range Search In Lucene

Friday, January 13th, 2006

You’ll be quite surprised to find out about how Lucene actually expands your range queries. As pointed out by Simon, range queries are enumerated for every possible value in the given range. Now ain’t that naive >-:. This limits the range to about 1024 values. Simon also points out a possible solution for dates by indexing them as strings of the form ‘yyyymmdd’.

I tried doing the same on one of my recent projects where I was indexing dates as strings ‘yyyymmdd’. But when I actually had a look at my expanded query via Limo, I found Lucene enumerating for string range queries as well.

Apparently this is not a bug nor even a feature but a “known behaviour”.

Tag Cloud Font Distribution Algorithm

Friday, January 6th, 2006

Ever thought of how the collection of tags with varying fontsizes (known as Tag Cloud) populated. As I say ‘theres an algorithm for everything’, theres an algorithm for this too. Assuming you know all about tag popularity (if not refer previous post) I’ll go ahead explaining it.

The distinct feature of tag clouds are the different groups of font sizes. Now the number of such groups desired depends entirely upon the developer. Usually having six such size-groups proves optimal.

Assume any suitable metric for measuring popularity (for instance ‘number of users using the tag’). We can always obtain the max and min numbers for the same. For example:

max(Popularity) = 130
min(Popularity) = 35

Therefore we can define one block of font-size as :
( max(Popularity) - min(Popularity) ) / 6

For the above values we get one such block range as (130 - 35) / 6 = 15.83 ~ 16
Font-sizes therefore could be bound as follows:

Range Font-Size
35 to 51 1
52 to 68 2
69 to 85 3
86 to 102 4
103 to 119 5
120 to 136 6

Thats as easy as it can get.