Archive for the Algorithm Category

I started this project as a weekend thingy a day back but hit a hurdle due to the Twitter rate limits. Since I couldn’t complete this project, I thought it would be rather nice to write a post explaining what I had in mind.

Twitterers like bloggers are driven/motivated by their *follower* counts (who doesn’t like attention). But unlike blogs there is no mechanism in Twitter to determine the visibility of ones tweets or gain insights into how one can maximize the visibility of their tweets. By visibility I mean the chance of my tweet being read by my followers. I’ll try to explain my approach to estimate this *visibility* with the following example:

tweetistic

The image above depicts the user (imagine yourself) at the center along with his followers and the users they are following (users which your followers are following – this can get crazy). The idea is just this dead simple – the visibility of your tweet depends on the twittering habits of your followers followings. Let me slow down – imagine a queue of size 20 assigned to each of your followers. This is the list of recent tweets displayed on Twitter for every user. Your tweet’s lifetime is as long it doesn’t get pushed out of this queue – which depends solely on the twittering frequency of your followers followings. The faster they tweet, the faster your tweet vanishes off this list and vice-versa. This lifetime/visibility of your tweet averaged over all your followers can be figured out algorithmically as follows:

  • Fetch all your followers and their followings (lets call this FF).
  • For each FF – calculate their average twittering time (do this by analyzing their last 50-100 tweets).
  • Using the times calculated above, calculate the time for your tweet to get pushed off the queue for each of your followers. This can be done by simulating tweets from each of the followers following (FF) at the frequency calculated above and pushing them into the queue.
  • Average this value across all your followers.

The above not only gives you insight on how long would your tweet be visible but also which of your followers could potentially read your tweet. For example, you could answer queries like – ‘if I tweet now, which of my followers could still read this tweet after 5 hours’ or ‘if I tweet now, which of my followers will have already read/missed this tweet after 3 hours’.

For more accurate results, one could also analyze the twittering habits of the FF’s (followers following) and generate a probabilistic model of their twittering frequency relative to the time of the day. This could help generate dynamic statistics depending on the time you intend to twitter.

Another insight that could be provided with the above statistics could be the duration one should wait before putting another tweet. I, for example, twitter sometimes in bursts (with just seconds delay between consecutive tweets). Followers usually give the most attention to your most recent tweet. The more consecutive tweets you have, the higher the chances of the older tweets getting less attention. If one can figure out the average time it would take for ones older tweet to get pushed off the ‘recent 20′ list of his follower/s, it would give one a better idea of how long should one wait before twittering again.

I’m sure that there are many more cool statistics/insights one can figure out from ones network in Twitter. I would be glad to hear out your take on this as well as any improvements you might have to suggest. I would love to make this an app once I can get Twitter to raise the hourly request limit for me – Twitter are you listening?

The Monty Hall Problem

| May 16th, 2008

I assume that most of you have watched the movie ‘21‘. Well if you haven’t, I recommend that you watch it. Around 20 minutes into the movie there is a scene where the professor poses a problem in front of the protagonist:

Suppose you’re on a game show, and you’re given the choice of three doors: Behind one door is a car; behind the others, goats. You pick a door, say No. 1, and the host, who knows what’s behind the doors, opens another door, say No. 3, which has a goat. He then says to you, “Do you want to pick door No. 2?” Is it to your advantage to switch your choice? [via Whitaker, Craig F. (1990). [Letter]. “Ask Marilyn” column, Parade Magazine p. 16]

The protagonist chooses to switch. It is not very intuitive as to why one should deviate from one’s original choice. It seems as if the probability of winning the car just bumped up to 1/2 from 1/3 after the host opens a door with a goat behind it. But it can be empirically proven that switching the choice is the rational strategy for the player. This ubiquitous game show dilemma is the Monty Hall problem.

Let me try to prove why is it so.

The first column represents the probable contents of the door originally chosen by the player (assume door 1). The second column represents the contents of the door opened by the host (assume door 3). The last column shows the contents of the other door, which the player chooses if he makes a switch. It can be clearly seen in the table above that probability of winning the car changes from 1/3 (as in the first column) to 2/3 (as in the last column). This probability can be observed if do some kind of random sampling.
The above is valid only for scenarios like the problem above where a player is given some choices and some choices are revealed based on which he has to make a decision.

There was a speculation about Google planning to introduce an ‘unavailable after’ meta tag. It would probably look something like this:

< META name='unavailable_after' content='Wed, 01 Aug 2007 00:00:01 GMT'>

By specifying this tag, webmasters can tell Google not to index a particular page after the specified date OR consider the given page as stale. This would be appropriate for promotional pages where the promotions expire after a given time. This would help unclog the search engine indexes of irrelevant data.

This valuable piece of information, provided by the ‘unavailable_after’ tag, would not only be used to clear up the Google’s index but could also make its way into Google’s ranking algorithms. There are two perspectives to how a search engine could use this data for ranking:

  • When a page specifies its expiry/unavailability date, it implicitly tells the search engine of the period for which it would be most relevant. Hence, as the unavailability date of a page approaches it should start becoming less irrelevant to a users query. For example: The user query for ‘fedora release notes‘, currently, has the top 3 results pointing to the notes of FC7, whereas the other results have a random mix of release notes for FC3, FC2, FC4 and FC5 (with FC3 and FC2 pages being ranked higher than FC5 and FC4 respectively). Lets say that FC8 was going to be released this November [schedule]. Assume that the Release Notes page for FC7 has the unavailable_after tag set for sometime around December. Thus, as December approaches, FC7 pages would start losing their relevancy for such queries and gradually transition lower in the search rankings, making FC8 the most relevant result. This would resolve the current inaccurate ordering of results obtained for the query ‘fedora release notes‘ on Google. This could be achieved in a manner similar to proposed in the following paper: Time Damping Of Textual Relevance.
  • This perspective (inverse to the above perspective) would be very specific to promotional/shopping related pages. Most shopping promotions/offers are valid for a given period. Hence, such pages should become more relevant to a users query as they approach their expiry date. For example: Consider a user query for ‘20% discount shoes‘. Lets assume that the results have pages from Zappos as well as Shoebuy both offering a 20% discount on shoes. The Shoebuy sale is going to last for about two more weeks from today (as specified by the unavailable_after tag) whereas the Zappos sale would be ending in another two days. Since both the stores are offering the same percentage discount, it would be more appropriate to rank the Zappos page higher as its offer would be ending soon. From the point of a user (shopper), he would be more interested in looking at offers ending soon, as he can always checkout the other long lasting offers at some later time.

The perspectives above represent only a few of the conceivable usages of the unavailable_after tag. There could be a numerous other perspectives to how this data could be utilized to improve search rankings.

I would lover to hear your take on the unavailable tag, particularly if you can provide another perspective to utilizing this data.

Search has always been an integral part of any tagging system. Such systems need to make sense out of the abundant user generated metadata such that the documents/items can be ranked in some order. However, very little has been said or written openly about such ranking algorithms for tagging systems.

Conventional Methods

Most systems, that allow tag search, base their rankings on factors like simply the ‘number of unique users’ or on ratios like ‘number of unique users for tag t / number of unique users for all tags’ etc. These conventional algorithms do work, but not quite so well for large datasets where they can be exploited. They also often do not represent the true relevance. Reminds me often of the pre-PageRank era of information retrieval systems.

So, which relevance algorithm do I use?

Well, you can always use the conventional methods, but then you can always try the algorithm I devised. This algorithm seems to capture the true essence of relevance in tagging systems. I call it the WisdomRank as it is truly based on the ‘wisdom’ of the crowds, the fundamental part of any social system. Read along to understand it in detail (or download the pdf).


Inferring relevance for tag search

from user authority – Abstract

Tagging is an act of imparting human knowledge/wisdom to objects. Thus a tag, a one word interpretation/categorization of the object by the user, fundamentally represents the basic unit of human wisdom for any object. This wisdom is difficult to quantify as it is relative for every user. One approach to quantify this would be to use the wisdom of the other users to define this for us. This can be done by assuming that every tag corresponds to a topic for which every user has some authority. Also, every tag added to an object corresponds to a vote, similar to the Digg model, asserting that the object belongs to that topic (tag).

Let us consider a user Ui who has tagged object Oj with the tag Tk. Whenever other users in the system tag Oj with Tk, they are implicitly affirming Ui’s wisdom for tag Tk.

Thus, we define the function affirmation for the tuple(u, d, t) as the number of other users who have also tagged document d with tag t:

affirmation(u, d, t) = ∑i=All users except ‘u’ tagged(ui, d, t)

where,

u – the user
d – the document/object
t – the tag
tagged – 1 if the user Ui has tagged d with t
- 0 otherwise

Hence, we can proceed to define the wisdom of the user for a topic (tag) t as the sum of all such assertions by other users,

wisdom(u, t) = ∑x=For all documents d tagged with tag t by U affirmation(u, d, t)

Likewise, we can now define the authority of a user for the topic t, as the ratio of the user’s wisdom to the collective wisdom for t. Hence,

authority(u, t) = wisdom(u, t) / ∑ wisdom(ui, t)

For example: Let us determine the authority of user u1 for tag t1

Object d1: Object d2: Object d3:
t1 by u1
t1 by u1 t1 by u2
t1 by u2
t3 by u1 t1 by u3
t1 by u3
t3 by u1
t2 by u1

affirmation(u1, d1, t1) = 2 affirmation(u1, d2, t1) = 0
Hence, wisdom(u1, t1) = 2

Likewise for other users,

wisdom(u2, t1) = 3
wisdom(u3, t1) = 3

Hence the authority of user u1 for t1 is as follows:

authority(u1, t1) = 2 / (2 + 3 + 3) = 2 / 8 = 0.25

Whenever a user tags an object with a tag, he does so with the authority he possesses for that tag. Thus as compared to conventional methods, where the objects are usually ranked on the number of instances of the tags, in this method the measure of the relevance of a tag for an object is equivalent to the sum of all such user authorities. Thus,

relevance_metric(d, t) = ∑i= all user who have tagged document d with t authority(u, t)

This relevance score, when calculated for every tag would provide an accurate measure for ranking the objects. As compared to the conventional methods where more number of instances of a tag for an object ensured a higher relevance for that tag, here the number of authoritative users counts.

Let us consider the following example:

Object d1: Object d2:
t1 by u1
t1 by u2
t2 by u5
t1 by u3
t1 by u4

Let us assume that u1 has a very high authority for tag t1. Hence in the above scenario, a search for tag t1 may rank d1 higher than d2, if

authority(u1, t1) > authority(u2, t1) + authority(u3, t1) + authority(u4, t1)

This result is with the assumption that u1’s authority is greater than those of u2,u3 and u4 combined.

On the other hand, d2 would be ranked higher than d1 if the combined authorities of u2, u3 and u4 exceed that of u1. If the majority of the users are suggesting something, it indicates that their suggestion is far more valuable than that of an individual user or a subset of users.

Future Enhancements

While calculating the user assertions this algorithm currently considers all such users as equal even though they may have varying authorities for the corresponding tag. As a future enhancement, I plan to incorporate the authorities of the users as well into the affirmation calculations.