Archive for the Tagging Category

Although I started this project as an experimental weekend thingy (to play around with Google App Engine), the project has shaped up well. Before you surf over to another blog, wondering what the hell I’m talking about, let me introduce you to “Personalized ARTICLE” aggregator (read as PARTICLE). The aim is to personalize a users online reading (just like what Findory did). Findory was an excellent service and I’ll be glad if I can achieve even an iota of what Greg created. This project is at very rudimetary and experimental stage. Rather than tapping into the users reading history on the site (monitored by the links clicked), the idea is to study how a users *interests*, scattered around at various “databases of interest” like, could be used to personalize online reading (news articles, blogs and more). This way the user could merrily browse the world wide web, bookmarking pages, doing his usual stuff and let PARTICLE worry about making this data useful.

Click here to try PARTICLE

Presently you need to provide PARTICLE with your username, which it uses to analyze your *interests* and present you with recent news stories you may like. It works well if you have a decent number of bookmarks in As I mentioned, the project is at a very rudimentary stage, so don’t feel disappointed by the results (ah! the unlucky few). I encourage you to play around with the app and recommend it to others to try. I’ll be making many changes/additions in the coming weeks.

Test drive PARTICLE at Kindly leave your feedback/comments/suggestions in the comments or send me an email at ‘anand at’.

[UPDATE] Yahoo! Research has a similar project called Garçon.

Search has always been an integral part of any tagging system. Such systems need to make sense out of the abundant user generated metadata such that the documents/items can be ranked in some order. However, very little has been said or written openly about such ranking algorithms for tagging systems.

Conventional Methods

Most systems, that allow tag search, base their rankings on factors like simply the ‘number of unique users’ or on ratios like ‘number of unique users for tag t / number of unique users for all tags’ etc. These conventional algorithms do work, but not quite so well for large datasets where they can be exploited. They also often do not represent the true relevance. Reminds me often of the pre-PageRank era of information retrieval systems.

So, which relevance algorithm do I use?

Well, you can always use the conventional methods, but then you can always try the algorithm I devised. This algorithm seems to capture the true essence of relevance in tagging systems. I call it the WisdomRank as it is truly based on the ‘wisdom’ of the crowds, the fundamental part of any social system. Read along to understand it in detail (or download the pdf).

Inferring relevance for tag search

from user authority – Abstract

Tagging is an act of imparting human knowledge/wisdom to objects. Thus a tag, a one word interpretation/categorization of the object by the user, fundamentally represents the basic unit of human wisdom for any object. This wisdom is difficult to quantify as it is relative for every user. One approach to quantify this would be to use the wisdom of the other users to define this for us. This can be done by assuming that every tag corresponds to a topic for which every user has some authority. Also, every tag added to an object corresponds to a vote, similar to the Digg model, asserting that the object belongs to that topic (tag).

Let us consider a user Ui who has tagged object Oj with the tag Tk. Whenever other users in the system tag Oj with Tk, they are implicitly affirming Ui’s wisdom for tag Tk.

Thus, we define the function affirmation for the tuple(u, d, t) as the number of other users who have also tagged document d with tag t:

affirmation(u, d, t) = ∑i=All users except ‘u’ tagged(ui, d, t)


u – the user
d – the document/object
t – the tag
tagged – 1 if the user Ui has tagged d with t
- 0 otherwise

Hence, we can proceed to define the wisdom of the user for a topic (tag) t as the sum of all such assertions by other users,

wisdom(u, t) = ∑x=For all documents d tagged with tag t by U affirmation(u, d, t)

Likewise, we can now define the authority of a user for the topic t, as the ratio of the user’s wisdom to the collective wisdom for t. Hence,

authority(u, t) = wisdom(u, t) / ∑ wisdom(ui, t)

For example: Let us determine the authority of user u1 for tag t1

Object d1: Object d2: Object d3:
t1 by u1
t1 by u1 t1 by u2
t1 by u2
t3 by u1 t1 by u3
t1 by u3
t3 by u1
t2 by u1

affirmation(u1, d1, t1) = 2 affirmation(u1, d2, t1) = 0
Hence, wisdom(u1, t1) = 2

Likewise for other users,

wisdom(u2, t1) = 3
wisdom(u3, t1) = 3

Hence the authority of user u1 for t1 is as follows:

authority(u1, t1) = 2 / (2 + 3 + 3) = 2 / 8 = 0.25

Whenever a user tags an object with a tag, he does so with the authority he possesses for that tag. Thus as compared to conventional methods, where the objects are usually ranked on the number of instances of the tags, in this method the measure of the relevance of a tag for an object is equivalent to the sum of all such user authorities. Thus,

relevance_metric(d, t) = ∑i= all user who have tagged document d with t authority(u, t)

This relevance score, when calculated for every tag would provide an accurate measure for ranking the objects. As compared to the conventional methods where more number of instances of a tag for an object ensured a higher relevance for that tag, here the number of authoritative users counts.

Let us consider the following example:

Object d1: Object d2:
t1 by u1
t1 by u2
t2 by u5
t1 by u3
t1 by u4

Let us assume that u1 has a very high authority for tag t1. Hence in the above scenario, a search for tag t1 may rank d1 higher than d2, if

authority(u1, t1) > authority(u2, t1) + authority(u3, t1) + authority(u4, t1)

This result is with the assumption that u1’s authority is greater than those of u2,u3 and u4 combined.

On the other hand, d2 would be ranked higher than d1 if the combined authorities of u2, u3 and u4 exceed that of u1. If the majority of the users are suggesting something, it indicates that their suggestion is far more valuable than that of an individual user or a subset of users.

Future Enhancements

While calculating the user assertions this algorithm currently considers all such users as equal even though they may have varying authorities for the corresponding tag. As a future enhancement, I plan to incorporate the authorities of the users as well into the affirmation calculations.

Interpreting Bookmarking

| July 11th, 2006

In a Social bookmarking system, users store lists of Internet resources, which they find useful. They also categorize their resources by the use of informally assigned, user-defined keywords or tags. [via Wikipedia]

One way of interpreting bookmarking systems is as the above ie simply a list of ‘resources which they find useful‘. If we consider bookmarking systems like, as per the above the bookmarks are just lists of urls which the user finds useful.

The same if viewed from another perspective could be interpreted as ‘resources which they find interesting‘. Hence a bookmarking system behaves as an ‘interest management system‘.

Yet another perspective, which is relevant to search engines, is the ‘history of visited pages labelled by keywords‘. This gives the search engines information about the likelihood of a user clicking on a result (the bookmarked link) for specific keywords (tags), something intrinsic to personalized search.

These are the various perspectives I intend to exploit with the tool I’ve started developing (read about the core concept here). I will be developing this tool initally on Simpy as Otis Gospodnetic (author of LIA book and the one behind Simpy) has offered his valuable support.

In the meantime you can ponder over other ways to look at bookmarking and possibly list it down here as well.

What happens when you cross Tagging with Machine Learning? Well you get a tool that learns as you tag. Sounds interesting? Then read on.

Learns? But what.

Thats what I asked myself yesterday. What I visualised was a tool that learns how to tag and what to tag from you. It would learn from past experiences, what you would like to tag and also what keywords (tags) you would likely use for them. And finally one day, when it has a large enough knowledge base, it could probably automate the entire task for you.


Imagine having a your own personal crawler, spidering the web in search of pages that might interest you and even saving the most likely ones. Imagine coming to office and seeing your toRead list already populated by the bot.

Sounds too optimistic? Well I’ll give it a try. Until then you’ll have to do what humans do best – tagging – on your own.