Similarity Measure: Cosine Similarity or Euclidean Distance or Both
This weekend I sat down experimenting with my project data to see if I could generate ‘related‘ documents. At first, the cosine similarity seemed very promising. The results seemed awfully similar and I was overjoyed to have completed such a cool feature in about an hours time. But then it struck me, the usual feeling you get that something is wrong when everything is working out smoothly. I realized that cosine similarity alone was not sufficient for finding similar documents.
So, what is cosine similarity?
Those not acquainted with Term Vector Theory and Cosine similarity can read this article.
Why does cosine similarity fail to capture the whole picture?

Let us consider two documents A and B represented by the vectors in the above figure. The cosine treats both vectors as unit vectors by normalizing them, giving you a measure of the angle between the two vectors. It does provide an accurate measure of similarity but with no regard to magnitude. But magnitude is an important factor while considering similarity.
For example, the cosine between a document which has ‘machine’ occurring 3 times and ‘learning’ 4 times and another document which has ‘machine’ occurring 300 times and ‘learning’ 400 times will hint that the two documents are pointing in almost the same direction. If magnitude (euclidean distance) was taken into account, the results would be quite different.
How do I get the accurate measure of similarity?
We have at our disposal two factors: one the cosine which gives us a measure of how similar two documents are, and the second the (euclidean) distance which gives us the magnitude of difference between the two documents. There could be a number of ways you could combine the two to determine the similarity measure.
Conclusion
The magnitude and cosine both provide us with a different aspect of similarity between two entities. It is upto us to either use them individually or in unison (as above) depending upon our application needs.
[Update]
As pointed out be Dr. E. Garcia (in the comments), similarities can be expressed by cosines, dot products, Jaccard’s Coefficients and in many other ways.
February 27th, 2007 at 6:40 am
Correction.
Somehow you are mistaken the definition for cosine similarity.
Cosine-based similarity values already incorporates the magnitudes (lengths) of the vectors. Let q be a query and d a document then by definition
sim(q,d) = cosine theta = Q*D/(| Q |*| D |)
where Q*D = dot product between q and d
| Q | = magnitude of query vector
| D | = magnitude of document vector
If you prenormalize all vectors (docs and q) before doing the calculations then the dot product between unit vectors equals the cosine angle since vector lengths will be unit lengths.
Regarding similarity. There are many ways to define similarity in IR. The most common is using cosines, but this is not the only way. Similarity can be defined using Jaccard’s Coefficients, dot products, etc, even weight products are often taken for similarity estimates. A typical example of the later is an LSI scheme wherein entries of a term-doc matrix are weights defined as occurrences (term frequencies). When using other scoring schemes with LSI, we no longer talk about occurrences, but about weights, so the similarity is taken for weight products.
Regards
Dr. E. Garcia
February 27th, 2007 at 5:10 pm
Hi Dr. E. Garcia,
>> “If you prenormalize all vectors (docs and q) before doing the
>> calculations then the dot product between unit vectors equals the
>> cosine angle since vector lengths will be unit lengths.”
This is exactly what I tried to explain in my post. Let me explain this with an example:
Let us consider two documents represented by the vectors A and B.
|A| = 5 |B| = 4
Thus, dot-product of A and B is as follows:
A.B = |A| |B| cos(theta)
= 5 * 4 * cos(theta)
= 20 * cos(theta)
Hence, in order to find the cosine we need to divide the above by |A| |B|. This reduces to:
A.B / |A||B| = 20 * cos(theta) / |A||B|
= 20 * cos(theta) / 20
= cos(theta)
Now, what I wanted to say was that cosine between two vectors is equivalent to the cosine between two such unit vectors. Consider two unit vectors Ua and Ub with the same theta as above. Then,
Ua.Ub = |Ua| |Ub| cos(theta)
= 1 * cos(theta)
= cos(theta)
Hence, in order to find the cosine we need to divide the above by |Ua| |Ub|. This reduces to:
Ua.Ub / |Ua||Ub| = 1 * cos(theta) / |Ua||Ub|
= 1 * cos(theta) / 1
= cos(theta)
This proves that cosine doesn’t account for the magnitude of the vectors. This can be accounted for by taking the euclidean distance between the vectors.
Hopefully this is sufficient to ascertain my claim that cosine alone does not suffice as a good similarity measure.
February 28th, 2007 at 6:41 am
No. This is not the correct way of computing cosine similarities.
Your equation
Sim = cosine(Θ) / norm(M)
is simply incorrect. It should be
sim = cosine(Θ) = dot product/norm product. When we compute cosine angles by definition we divide a dot product by the length of the vectors involved.
In addition in the example above you to know a pair of coordinates for each document A and B. Let say these are A(x1, y1) and B(x2, y2)
Then the dot product is
A*B = x1*x2 + y1*y2
the norm of each vector (their length in this case) is
|A|= (x1*x1 + y1*y1)^1/2
|B| = (x2*x2 + y2*y2)^1/2
the length product (norm product) is
| A |*| B |
The cosine similarity, sim, IS the cosine angle:
sim = cosine(theta) = A*B/(|A|*|B|)
So, to compute the cosine angle we must divided by the length by definition. This cosine angle IS the similarity score. Therefore, your equation
Sim = cosine(Θ) / norm(M)
is incorrect.
February 28th, 2007 at 10:43 am
What I have tried above is to suggest a function that incorporates both cosine (numerator) and euclidean distance (denominator). It is not an attempt to derive another formula for cosine altogether. The cosine formula remains unchanged, but after determining the cosine and euclidean, individually by their respective formulas, we can combine to two to get a variable that accounts for both the similarity (represented by cosine) and the magnitude of difference (euclidean).
February 28th, 2007 at 9:06 pm
The equation for computing cosine angles already incorporates the euclidean distance of the vectors in the denominator.
In linear algebra this is actually the magnitude of the vectors, also called their norm. There is no need for normalizing the cosine angle since this is already normalized.
In IR similarity is often expressed as a cosine angle since this precisely accounts for the dot products and euclidean distances. There is a reason for this: the cosine angle of data point are equivalent to the product moment correlation coefficient found in statistics books.
Similarity is not a metric, a distance is. One can convert a similarity into distances and viceversa, thought not that obvious because of the triangular inequality that must be meet by a distance metric. Expressing similarity as cosines only to next divide the cosine angle once again by the distance (magnitude) of the vectors goes against this cardinal relationships.
Similarities can be expressed by cosines, dot products, Jaccard’s Coefficients and in many other ways. But dividing a cosine similarity once again by a distance metric (euclidean in this case) results in a quantity with no significance for IR purposes whatsoever since then it will not be possible to make the conversion.
If you want to incorporate proximity between any two points (the relative neighborhood), this is already done in IR via scalar clustering theory.
Here you just need to express entries of a mxm matrix as Jaccard’s Coefficients and store these in a new matrix. Then from this matrix compute cosine similarities. The new cosine similarity scores then will incorporate the induced effect of neighboring points.
Dr. E. Garcia
February 28th, 2007 at 9:16 pm
I get your point. Thanks. I’ll modify my post accordingly.
March 13th, 2007 at 1:09 pm
Hello,
Got a clear understanding of calcualtion of the similarity of the documents. Looks like a lot of people have implemented this cosine stuff. Even I’m going to implement the same for finding the document similarity. But I wonder, how should it be implemented in an efficient way. If I use a term document matrix and just the presence/absence as the matrix values, the matrix size wil be Dn*Tm. I feel processing such a huge matrix would be greatly inefficient. Any pointers in this direction??
Thanks
Sandhya
March 16th, 2007 at 10:49 am
Since you intent to store presence/absence values, it would be more memory efficient to store values for only terms that are present. If
|Tm| = the maximum number of words in any document
then,
if a document contains only say 20 words, you would still store absence values for |Tm|-20 words.
I computed similarity for documents indexed in Lucene [http://lucene.apache.org/]. Lucene stores the term vector for every document in the index itself.
If you do need to store a Dn*Tm matrix (with absence/presence values), assuming that a presence value is represented by 1 and an absence by 0, you could take the vectors of the two documents under consideration and ‘AND’ them. That would help you ease out the DOT product calculation.
April 20th, 2007 at 7:48 pm
Dr. E. Garcia
You just need to test your math using real numbers to know that you are indeed in correct.
A = {1, 1, 1}
B = {2, 2, 2}
Using your definition of similarity [which by the way is the right definition], you get:
Dot Product:
A*B = x1*x2 + y1*y2 + z1*z2 = 1*2+1*2+1*2 = 6
the norm of each vector (their length in this case) is
|A|= (x1*x1 + y1*y1)^1/2 = (1+1+1)^1/2 = 1.732050807569
|B| = (x2*x2 + y2*y2)^1/2 = (4+4+4)^1/2 = 3.464101615138
|A|*|B| = 5.999999
sim = cosine(theta) = A*B/(|A|*|B|) = 6/5.99999 which is basically 1!!!
Therefore, I cannot but conclude that you have missed the point of the original post.
April 20th, 2007 at 8:04 pm
site owner,
I question your wisdom in worshiping google.
Google is the best of what we have today. It is so far from great that it isn’t even funny.
it is like worshiping yahoo in 1995, or lycos in 1998.
May 14th, 2007 at 6:21 pm
The example you give, of course will give a value of 1 in the cosine similarity scale. It is not surprising.
Can you tell what I am missing? I standby about what I have stated. The original post was about defining similarity.
As mentioned, there are many ways of defining similarity. One way is using just dot products, another using cosine angles. Even we can express it as Jaccard’s Coefficient.
This is what we teach at graduate school. I recommend you to get a good book on data mining.
The original poster presented an incorrect definition for similarity and later kindly recognized this.
September 23rd, 2007 at 5:47 pm
I am new in this field of mining and was confused about a thing after reading all the posts. If cosine measure is a distance measure and euclidean measure is a magnitude measure then can any mathematical relationship exists between the two?
April 25th, 2008 at 2:27 pm
If vectors X and Y are L2 normalized, it is possible to prove that using the cosine distance (1 - cosine similarity) is interchangeable with the euclidean distance.
May 16th, 2008 at 1:44 am
Hi Anand and Dr. E.Garcia,
May I quote the original post:
“For example, the cosine between a document which has ‘machine’ occurring 3 times and ‘learning’ 4 times and another document which has ‘machine’ occurring 300 times and ‘learning’ 400 times will hint that the two documents are pointing in almost the same direction. If magnitude (euclidean distance) was taken into account, the results would be quite different.”
I’m kind of getting what you (Anand) intended to say from the early post, which is the cosine similarity is kind of normalizing the magnitude of the vector in its calculation. For example the coordinates A(3,4), B(5,12) and B’(500,1200); the cosine similarity between A and B, and A and B’ are the same; however, the euclidean distance is of course different.
In a context of query and document searhing, the larger the document tends to scales up (in ideal case while it may not always practical; scales up the vector dimension equally, so although the magnitude i.e. the size of the document increases, the vector direction remains the same). In this case, cosine similarity may have a desirable normalizing effect compared to using the euclidean distance, which will tends to favor document with similar scale (as well as the direction) from the query. In such application, we may expect that the magnitude of query can be much less than the document searched - thus using euclidean will tend to favor smaller documents.
Perhaps if the application is different, such as comparing the strict similarity between two documents, euclidean distance may have advantage if we intend to account similar magnitude of both documents.
June 12th, 2008 at 6:20 am
I’m not an expert but I’m not sure what point Dr. E. Garcia is trying to make here…cosine similarity does use magnitudes in the calculation but the calculation in fact normalizes the magnitudes, and the output of that calculation is an angle, and nothing else.
June 14th, 2008 at 10:05 pm
Jack: Well she was making the point that the two cannot be combined together to give you a number that takes both similarity and magnitude into account.
July 3rd, 2008 at 10:47 am
To all above:
You cannot blindfold exchange distance and similarity without knowing what is the model and the subject of that model.
http://www.miislita.com/searchito/binary-similarity-calculator.html
September 25th, 2008 at 1:35 am
Dear All
As you know, cosine similarity measure loses the word order - while is exploited for measuring the similarity of texts. What other methods exist that integrate some syntactic information like word order in measuring the similarity?
Thanks very much
October 30th, 2008 at 4:57 pm
Anand: I’m a he, not a she.
Hooman: With a positional inverted index, word orden and positional information can be stored into vectors and all kind of similarity scores computed.
BTW, at http://irthoughts.wordpress.com/2008/10/29/similarity-pearson-and-spearman-coefficients/ we show the connection between Pearson and Spearman’s Correlation Coefficients with cosine angles and dot products. None of these are distances. The links listed in that post might help to clear up the difference between similarity and distance.
Another thing. Few IRs have, unfortunately, used the expression ‘Similarity Distance’. Avoid it. This expression is an oxymoron as Distance is Dissimilarity.
I hope this help.