Thanks. ]]>

Interesting dialogue. There are some other very interesting points worth bearing in mind when working with these measures.

Cosine Similarity and Euclidean Distance capture a lot of the same information. However whereas Euclidean Distance measures an actual distance between the two points of interest, Cosine can be thought of as measuring their apparent distance as viewed from the origin. Think of stars in the sky if you like analogies. The stars in Taurus are all relatively close together from our point of view. But in reality some of them are probably many times closer to us than others – the distance information has been effectively discarded (this is your normalization factor).

If you’re sailing a ship and navigating by the stars, then this is an appropriate thing to do. If you’re navigating an interstellar spacecraft, it is probably not. And so it is with your application. Think of the meaning of the vectors with respect to your application and you will chose the most appropriate measure. But combining Cosine with Euclidean will probably not get you very far, as you are simply re-using a lot of information you already have.

Another way to think of Cosine Similarity is as measuring the *relative* proportions of the various features or dimensions – when all the dimensions between two vectors are in proportion (correlated), you get maximum similarity. Euclidean distance and its relatives (like Manhattan distance) are more concerned with absolutes. In practice though, the measures will often give similar results, especially on very high dimensional data (by normalization, Cosine Similarity effectively reduces the dimensionality of your data by 1. The higher the dimensionality of your data therefore – all other things being equal – the less significant this difference becomes. All things *may not* be equal though and it may be that the distance from the origin is a dimension of special significance – so again, check the logic of your app).

Another fact about Cosine Similarity that has been pointed out is that it is not really a measure of distance. This is an issue of semantics really. You can address this complaint to a large extent by squaring the value. Your value still ranges between 0 and 1, but it now has the property that the complement of the value (1 minus the value) is equal to the square of the *Sine* of the angle between the vectors – which an equivalent measure of *dissimilarity*. You need to square the values for this relationship to hold. Doing this gives you something akin to a proportion of similarity or dissimilarity (when it is 0.5 you might – but probably shouldn’t – say that your points are neither particularly similar nor dissimilar). It is equivalent to taking the R2 value when analysing correlations (which is usually the done thing when trying to do anything cleverer than just rank the data).

While we’re on the subject of interpreting values, if your vector components comprise probabilities which sum to 1 (e.g normalized frequency counts, which represent the probability of finding any particular word at a random location in the document), then you can normalize the vector by simply converting all of the probabilities to their square roots. This will naturally give you a vector whose *length* is 1. Note that this is not the same as scaling the vector linearly as is often done (and as Cosine Similarity does), although that also results in a vector of length 1! I’ll leave you to work out whether that is a good thing or not. :-)

]]>Regards,

Ahmed