Archive for February, 2007

This weekend I sat down experimenting with my project data to see if I could generate ‘related‘ documents. At first, the cosine similarity seemed very promising. The results seemed awfully similar and I was overjoyed to have completed such a cool feature in about an hours time. But then it struck me, the usual feeling you get that something is wrong when everything is working out smoothly. I realized that cosine similarity alone was not sufficient for finding similar documents.

So, what is cosine similarity?

Those not acquainted with Term Vector Theory and Cosine similarity can read this article.

Why does cosine similarity fail to capture the whole picture?

Let us consider two documents A and B represented by the vectors in the above figure. The cosine treats both vectors as unit vectors by normalizing them, giving you a measure of the angle between the two vectors. It does provide an accurate measure of similarity but with no regard to magnitude. But magnitude is an important factor while considering similarity.

For example, the cosine between a document which has ‘machine’ occurring 3 times and ‘learning’ 4 times and another document which has ‘machine’ occurring 300 times and ‘learning’ 400 times will hint that the two documents are pointing in almost the same direction. If magnitude (euclidean distance) was taken into account, the results would be quite different.

How do I get the accurate measure of similarity?

We have at our disposal two factors: one the cosine which gives us a measure of how similar two documents are, and the second the (euclidean) distance which gives us the magnitude of difference between the two documents. There could be a number of ways you could combine the two to determine the similarity measure.

Conclusion

The magnitude and cosine both provide us with a different aspect of similarity between two entities. It is upto us to either use them individually or in unison (as above) depending upon our application needs.

[Update]

As pointed out be Dr. E. Garcia (in the comments), similarities can be expressed by cosines, dot products, Jaccard’s Coefficients and in many other ways.