Archive for January, 2006

Range Search In Lucene

| January 13th, 2006

You’ll be quite surprised to find out about how Lucene actually expands your range queries. As pointed out by Simon, range queries are enumerated for every possible value in the given range. Now ain’t that naive >-:. This limits the range to about 1024 values. Simon also points out a possible solution for dates by indexing them as strings of the form ‘yyyymmdd’.

I tried doing the same on one of my recent projects where I was indexing dates as strings ‘yyyymmdd’. But when I actually had a look at my expanded query via Limo, I found Lucene enumerating for string range queries as well.

Apparently this is not a bug nor even a feature but a “known behaviour”.

Ever thought of how the collection of tags with varying fontsizes (known as Tag Cloud) populated. As I say ‘theres an algorithm for everything’, theres an algorithm for this too. Assuming you know all about tag popularity (if not refer previous post) I’ll go ahead explaining it.

The distinct feature of tag clouds are the different groups of font sizes. Now the number of such groups desired depends entirely upon the developer. Usually having six such size-groups proves optimal.

Assume any suitable metric for measuring popularity (for instance ‘number of users using the tag’). We can always obtain the max and min numbers for the same. For example:

max(Popularity) = 130
min(Popularity) = 35

Therefore we can define one block of font-size as :
( max(Popularity) – min(Popularity) ) / 6

For the above values we get one such block range as (130 – 35) / 6 = 15.83 ~ 16
Font-sizes therefore could be bound as follows:

Range Font-Size
35 to 51 1
52 to 68 2
69 to 85 3
86 to 102 4
103 to 119 5
120 to 136 6

Thats as easy as it can get.

Calculation Of Tag Popularity

| January 2nd, 2006

Determinig the popularity of tags has very fluid solutions which keep changing from application to application. But in general one metric that can be used is the number of unique items tagged using the particular tag. Secondly another metric that is the number of unique users using this tag could also be used. I’ve come up with a formula that encompasses both of these:

( Usage Count / Number of tagged Items ) * ( User Count / Number of Taggers )

where,
Usage Count (UsgCnt) : the number of unique items having the tag.
Number of tagged Items (NTI) : the total number of items having atleast one tag (i.e. items participating in tagging)
User Count (UsrCnt) : the number of users using this tag.
Number of Taggers (NOT) : the total number of users participating in tagging.

Case 1:
UsgCnt = 15, NTI = 40, UsrCnt = 2, NOT = 20
Popularity = 0.0375
Note: This represents a case in which the two users may be trying to spam the system by tagging many items by the specific tag.

Case 2:
UsgCnt = 15, NTI = 40, UsrCnt = 9, NOT = 20
Popularity = 0.1685
Note: Here we clearly see that as the number of users using this tag increases the popularity increases as well (suggesting no spam but folksonomy).

Case 3:
UsgCnt = 15, NTI = 40, UsrCnt = 1, NOT = 1
Popularity = 0.375
Note: Here it can be noted that if there is only one user in the system the popularity becomes independent of the user ratio and depends entirely on the tagged items ratio.

Case 4:
UsgCnt = 40, NTI = 40, UsrCnt = 10, NOT = 20
Popularity = 0.5
Note: In this case if all the messages in the system are tagged using the specific tag (UsgCnt = NTI ) the popularity depends entirely on the number of users using this tag.

This gives a fairly rough idea of tag popularity calculation.