Archive for the Search Category

Performance Juice

| July 2nd, 2008

In the company of *exceptional* people at the Exceptional Performance Group, I too have caught on the performance bug. I asked myself today – “Wouldn’t it be great to have all the performance blogs consolidated at one place”. There are a number of performance experts (both frontend and backend) outside of Yahoo!. I do have a couple of such blogs on my feedreader and I’m sure you to have some to share.

I’m trying to consolidate all such performance centric blogs at Performance Juice. While I’m busy scouring the web for some great blogs I may have missed, I would love to include any links you might have into Performance Juice. You can contribute by listing blogs into the spreadsheet below. You can edit the spreadsheet here. Lets make Performance Juice a community driven vertical search for anything performance.

Be sure to quench your performance thirst at Performance Juice or just subscribe to this feed and stay up to date on the performance karma.

There are many performance related posts we come across everyday which are not necessarily from performance centric blogs. If you come across any such great posts, just tag them as “performancejuice” on del.icio.us and I’ll make sure they are included.

Any volunteers to manage the search engine? Head here.

[UPDATE] The custom search is now also powered by performance bookmarks on del.icio.us.

You can edit the spreadsheet here.

There was a speculation about Google planning to introduce an ‘unavailable after’ meta tag. It would probably look something like this:

< META name='unavailable_after' content='Wed, 01 Aug 2007 00:00:01 GMT'>

By specifying this tag, webmasters can tell Google not to index a particular page after the specified date OR consider the given page as stale. This would be appropriate for promotional pages where the promotions expire after a given time. This would help unclog the search engine indexes of irrelevant data.

This valuable piece of information, provided by the ‘unavailable_after’ tag, would not only be used to clear up the Google’s index but could also make its way into Google’s ranking algorithms. There are two perspectives to how a search engine could use this data for ranking:

  • When a page specifies its expiry/unavailability date, it implicitly tells the search engine of the period for which it would be most relevant. Hence, as the unavailability date of a page approaches it should start becoming less irrelevant to a users query. For example: The user query for ‘fedora release notes‘, currently, has the top 3 results pointing to the notes of FC7, whereas the other results have a random mix of release notes for FC3, FC2, FC4 and FC5 (with FC3 and FC2 pages being ranked higher than FC5 and FC4 respectively). Lets say that FC8 was going to be released this November [schedule]. Assume that the Release Notes page for FC7 has the unavailable_after tag set for sometime around December. Thus, as December approaches, FC7 pages would start losing their relevancy for such queries and gradually transition lower in the search rankings, making FC8 the most relevant result. This would resolve the current inaccurate ordering of results obtained for the query ‘fedora release notes‘ on Google. This could be achieved in a manner similar to proposed in the following paper: Time Damping Of Textual Relevance.
  • This perspective (inverse to the above perspective) would be very specific to promotional/shopping related pages. Most shopping promotions/offers are valid for a given period. Hence, such pages should become more relevant to a users query as they approach their expiry date. For example: Consider a user query for ‘20% discount shoes‘. Lets assume that the results have pages from Zappos as well as Shoebuy both offering a 20% discount on shoes. The Shoebuy sale is going to last for about two more weeks from today (as specified by the unavailable_after tag) whereas the Zappos sale would be ending in another two days. Since both the stores are offering the same percentage discount, it would be more appropriate to rank the Zappos page higher as its offer would be ending soon. From the point of a user (shopper), he would be more interested in looking at offers ending soon, as he can always checkout the other long lasting offers at some later time.

The perspectives above represent only a few of the conceivable usages of the unavailable_after tag. There could be a numerous other perspectives to how this data could be utilized to improve search rankings.

I would lover to hear your take on the unavailable tag, particularly if you can provide another perspective to utilizing this data.

Search has always been an integral part of any tagging system. Such systems need to make sense out of the abundant user generated metadata such that the documents/items can be ranked in some order. However, very little has been said or written openly about such ranking algorithms for tagging systems.

Conventional Methods

Most systems, that allow tag search, base their rankings on factors like simply the ‘number of unique users’ or on ratios like ‘number of unique users for tag t / number of unique users for all tags’ etc. These conventional algorithms do work, but not quite so well for large datasets where they can be exploited. They also often do not represent the true relevance. Reminds me often of the pre-PageRank era of information retrieval systems.

So, which relevance algorithm do I use?

Well, you can always use the conventional methods, but then you can always try the algorithm I devised. This algorithm seems to capture the true essence of relevance in tagging systems. I call it the WisdomRank as it is truly based on the ‘wisdom’ of the crowds, the fundamental part of any social system. Read along to understand it in detail (or download the pdf).


Inferring relevance for tag search

from user authority – Abstract

Tagging is an act of imparting human knowledge/wisdom to objects. Thus a tag, a one word interpretation/categorization of the object by the user, fundamentally represents the basic unit of human wisdom for any object. This wisdom is difficult to quantify as it is relative for every user. One approach to quantify this would be to use the wisdom of the other users to define this for us. This can be done by assuming that every tag corresponds to a topic for which every user has some authority. Also, every tag added to an object corresponds to a vote, similar to the Digg model, asserting that the object belongs to that topic (tag).

Let us consider a user Ui who has tagged object Oj with the tag Tk. Whenever other users in the system tag Oj with Tk, they are implicitly affirming Ui’s wisdom for tag Tk.

Thus, we define the function affirmation for the tuple(u, d, t) as the number of other users who have also tagged document d with tag t:

affirmation(u, d, t) = ∑i=All users except ‘u’ tagged(ui, d, t)

where,

u – the user
d – the document/object
t – the tag
tagged – 1 if the user Ui has tagged d with t
- 0 otherwise

Hence, we can proceed to define the wisdom of the user for a topic (tag) t as the sum of all such assertions by other users,

wisdom(u, t) = ∑x=For all documents d tagged with tag t by U affirmation(u, d, t)

Likewise, we can now define the authority of a user for the topic t, as the ratio of the user’s wisdom to the collective wisdom for t. Hence,

authority(u, t) = wisdom(u, t) / ∑ wisdom(ui, t)

For example: Let us determine the authority of user u1 for tag t1

Object d1: Object d2: Object d3:
t1 by u1
t1 by u1 t1 by u2
t1 by u2
t3 by u1 t1 by u3
t1 by u3
t3 by u1
t2 by u1

affirmation(u1, d1, t1) = 2 affirmation(u1, d2, t1) = 0
Hence, wisdom(u1, t1) = 2

Likewise for other users,

wisdom(u2, t1) = 3
wisdom(u3, t1) = 3

Hence the authority of user u1 for t1 is as follows:

authority(u1, t1) = 2 / (2 + 3 + 3) = 2 / 8 = 0.25

Whenever a user tags an object with a tag, he does so with the authority he possesses for that tag. Thus as compared to conventional methods, where the objects are usually ranked on the number of instances of the tags, in this method the measure of the relevance of a tag for an object is equivalent to the sum of all such user authorities. Thus,

relevance_metric(d, t) = ∑i= all user who have tagged document d with t authority(u, t)

This relevance score, when calculated for every tag would provide an accurate measure for ranking the objects. As compared to the conventional methods where more number of instances of a tag for an object ensured a higher relevance for that tag, here the number of authoritative users counts.

Let us consider the following example:

Object d1: Object d2:
t1 by u1
t1 by u2
t2 by u5
t1 by u3
t1 by u4

Let us assume that u1 has a very high authority for tag t1. Hence in the above scenario, a search for tag t1 may rank d1 higher than d2, if

authority(u1, t1) > authority(u2, t1) + authority(u3, t1) + authority(u4, t1)

This result is with the assumption that u1’s authority is greater than those of u2,u3 and u4 combined.

On the other hand, d2 would be ranked higher than d1 if the combined authorities of u2, u3 and u4 exceed that of u1. If the majority of the users are suggesting something, it indicates that their suggestion is far more valuable than that of an individual user or a subset of users.

Future Enhancements

While calculating the user assertions this algorithm currently considers all such users as equal even though they may have varying authorities for the corresponding tag. As a future enhancement, I plan to incorporate the authorities of the users as well into the affirmation calculations.

This weekend I sat down experimenting with my project data to see if I could generate ‘related‘ documents. At first, the cosine similarity seemed very promising. The results seemed awfully similar and I was overjoyed to have completed such a cool feature in about an hours time. But then it struck me, the usual feeling you get that something is wrong when everything is working out smoothly. I realized that cosine similarity alone was not sufficient for finding similar documents.

So, what is cosine similarity?

Those not acquainted with Term Vector Theory and Cosine similarity can read this article.

Why does cosine similarity fail to capture the whole picture?

Let us consider two documents A and B represented by the vectors in the above figure. The cosine treats both vectors as unit vectors by normalizing them, giving you a measure of the angle between the two vectors. It does provide an accurate measure of similarity but with no regard to magnitude. But magnitude is an important factor while considering similarity.

For example, the cosine between a document which has ‘machine’ occurring 3 times and ‘learning’ 4 times and another document which has ‘machine’ occurring 300 times and ‘learning’ 400 times will hint that the two documents are pointing in almost the same direction. If magnitude (euclidean distance) was taken into account, the results would be quite different.

How do I get the accurate measure of similarity?

We have at our disposal two factors: one the cosine which gives us a measure of how similar two documents are, and the second the (euclidean) distance which gives us the magnitude of difference between the two documents. There could be a number of ways you could combine the two to determine the similarity measure.

Conclusion

The magnitude and cosine both provide us with a different aspect of similarity between two entities. It is upto us to either use them individually or in unison (as above) depending upon our application needs.

[Update]

As pointed out be Dr. E. Garcia (in the comments), similarities can be expressed by cosines, dot products, Jaccard’s Coefficients and in many other ways.