 <?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Similarity Measure: Cosine Similarity or Euclidean Distance or Both</title>
	<atom:link href="http://semanticvoid.com/blog/index.php/2007/02/23/similarity-measure-cosine-similarity-or-euclidean-distance-or-both/feed/" rel="self" type="application/rss+xml" />
	<link>http://semanticvoid.com/blog/2007/02/23/similarity-measure-cosine-similarity-or-euclidean-distance-or-both/</link>
	<description>Extracting the semantics from the void</description>
	<lastBuildDate>Mon, 08 Mar 2010 18:25:14 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.1</generator>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
		<item>
		<title>By: Richard</title>
		<link>http://semanticvoid.com/blog/2007/02/23/similarity-measure-cosine-similarity-or-euclidean-distance-or-both/comment-page-1/#comment-121827</link>
		<dc:creator>Richard</dc:creator>
		<pubDate>Mon, 04 Jan 2010 06:48:05 +0000</pubDate>
		<guid isPermaLink="false">http://semanticvoid.com/blog/2007/02/23/similarity-measure-cosine-similarity-or-euclidean-distance-or-both/#comment-121827</guid>
		<description>Hi All,
Thanks for providing this greate tutorial. I am new and just start learning about data mining. I am sorry if my question sounds childish. I know how to calculate tf/idf. Thanks for egarcia, now i also know how to calculate cosine similarity. But I have a question.
Lets say i have two documents. each document has 2 words. 1 word is similar in both documents. So the tf/idf would be like doc A (0,0.5) doc B (0,0.5). Should I calculate cosine similarity on tf/idf values? If I do it return 1 while i can see both documents are 50% same not 100%. 
Am I missing some thing? Should I calculate cosine similarity only on IDF values or Tf values?
Thanks</description>
		<content:encoded><![CDATA[<p>Hi All,<br />
Thanks for providing this greate tutorial. I am new and just start learning about data mining. I am sorry if my question sounds childish. I know how to calculate tf/idf. Thanks for egarcia, now i also know how to calculate cosine similarity. But I have a question.<br />
Lets say i have two documents. each document has 2 words. 1 word is similar in both documents. So the tf/idf would be like doc A (0,0.5) doc B (0,0.5). Should I calculate cosine similarity on tf/idf values? If I do it return 1 while i can see both documents are 50% same not 100%.<br />
Am I missing some thing? Should I calculate cosine similarity only on IDF values or Tf values?<br />
Thanks</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: egarcia</title>
		<link>http://semanticvoid.com/blog/2007/02/23/similarity-measure-cosine-similarity-or-euclidean-distance-or-both/comment-page-1/#comment-31664</link>
		<dc:creator>egarcia</dc:creator>
		<pubDate>Thu, 30 Oct 2008 23:57:04 +0000</pubDate>
		<guid isPermaLink="false">http://semanticvoid.com/blog/2007/02/23/similarity-measure-cosine-similarity-or-euclidean-distance-or-both/#comment-31664</guid>
		<description>Anand: I&#039;m a he, not a she.

Hooman: With a positional inverted index, word orden and positional information can be stored into vectors and all kind of similarity scores computed.

BTW, at http://irthoughts.wordpress.com/2008/10/29/similarity-pearson-and-spearman-coefficients/ we show the connection between Pearson and Spearman&#039;s Correlation Coefficients with cosine angles and dot products. None of these are distances. The links listed in that post might help to clear up the difference between similarity and distance.

Another thing. Few IRs have, unfortunately, used the expression &#039;Similarity Distance&#039;. Avoid it. This expression is an oxymoron as Distance is Dissimilarity.

I hope this help.</description>
		<content:encoded><![CDATA[<p>Anand: I&#8217;m a he, not a she.</p>
<p>Hooman: With a positional inverted index, word orden and positional information can be stored into vectors and all kind of similarity scores computed.</p>
<p>BTW, at <a href="http://irthoughts.wordpress.com/2008/10/29/similarity-pearson-and-spearman-coefficients/" rel="nofollow">http://irthoughts.wordpress.com/2008/10/29/similarity-pearson-and-spearman-coefficients/</a> we show the connection between Pearson and Spearman&#8217;s Correlation Coefficients with cosine angles and dot products. None of these are distances. The links listed in that post might help to clear up the difference between similarity and distance.</p>
<p>Another thing. Few IRs have, unfortunately, used the expression &#8216;Similarity Distance&#8217;. Avoid it. This expression is an oxymoron as Distance is Dissimilarity.</p>
<p>I hope this help.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Hooman</title>
		<link>http://semanticvoid.com/blog/2007/02/23/similarity-measure-cosine-similarity-or-euclidean-distance-or-both/comment-page-1/#comment-29873</link>
		<dc:creator>Hooman</dc:creator>
		<pubDate>Thu, 25 Sep 2008 08:35:49 +0000</pubDate>
		<guid isPermaLink="false">http://semanticvoid.com/blog/2007/02/23/similarity-measure-cosine-similarity-or-euclidean-distance-or-both/#comment-29873</guid>
		<description>Dear All
As you know, cosine similarity measure loses the word order - while is exploited for measuring the similarity of texts. What other methods exist that integrate some  syntactic information like word order in measuring the similarity?
Thanks very much</description>
		<content:encoded><![CDATA[<p>Dear All<br />
As you know, cosine similarity measure loses the word order &#8211; while is exploited for measuring the similarity of texts. What other methods exist that integrate some  syntactic information like word order in measuring the similarity?<br />
Thanks very much</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: James</title>
		<link>http://semanticvoid.com/blog/2007/02/23/similarity-measure-cosine-similarity-or-euclidean-distance-or-both/comment-page-1/#comment-25297</link>
		<dc:creator>James</dc:creator>
		<pubDate>Mon, 21 Jul 2008 05:05:50 +0000</pubDate>
		<guid isPermaLink="false">http://semanticvoid.com/blog/2007/02/23/similarity-measure-cosine-similarity-or-euclidean-distance-or-both/#comment-25297</guid>
		<description>For more discussion see:
http://www.pui.ch/phred/automated_tag_clustering/</description>
		<content:encoded><![CDATA[<p>For more discussion see:<br />
<a href="http://www.pui.ch/phred/automated_tag_clustering/" rel="nofollow">http://www.pui.ch/phred/automated_tag_clustering/</a></p>
]]></content:encoded>
	</item>
	<item>
		<title>By: James</title>
		<link>http://semanticvoid.com/blog/2007/02/23/similarity-measure-cosine-similarity-or-euclidean-distance-or-both/comment-page-1/#comment-25296</link>
		<dc:creator>James</dc:creator>
		<pubDate>Mon, 21 Jul 2008 04:52:48 +0000</pubDate>
		<guid isPermaLink="false">http://semanticvoid.com/blog/2007/02/23/similarity-measure-cosine-similarity-or-euclidean-distance-or-both/#comment-25296</guid>
		<description>The first document has only 3/300 or 1 percent the number of mentions of the word &quot;machine&quot; and likewise only 4/400 or 1 percent the number of mentions of the word &quot;learning&quot; as compared to the second document. This small frequency is within the margin of error of most statistical measurements and therefore I recommend any supposedly &quot;similarity&quot; be expressed as a probability. To see my observation in a different light, imagine a triangle with only 3 points or vertices expressed. Now imagine a triangle with 300 points expressed filling out the 3 sides. Now introduce the same random point in both sets. In the first set, the &quot;triangle&quot; will most likely becomes a square or rhombus (quite a different classification) while the random point in the second set doesn&#039;t greatly disturb our picture. It remains a triangle with some &quot;noise&quot;. Consequently, it would make sense to limit so-called &quot;comparisons&quot; or &quot;similarity&quot; to sets with the same order of magnitude of points to preserve statistical accuracy.</description>
		<content:encoded><![CDATA[<p>The first document has only 3/300 or 1 percent the number of mentions of the word &#8220;machine&#8221; and likewise only 4/400 or 1 percent the number of mentions of the word &#8220;learning&#8221; as compared to the second document. This small frequency is within the margin of error of most statistical measurements and therefore I recommend any supposedly &#8220;similarity&#8221; be expressed as a probability. To see my observation in a different light, imagine a triangle with only 3 points or vertices expressed. Now imagine a triangle with 300 points expressed filling out the 3 sides. Now introduce the same random point in both sets. In the first set, the &#8220;triangle&#8221; will most likely becomes a square or rhombus (quite a different classification) while the random point in the second set doesn&#8217;t greatly disturb our picture. It remains a triangle with some &#8220;noise&#8221;. Consequently, it would make sense to limit so-called &#8220;comparisons&#8221; or &#8220;similarity&#8221; to sets with the same order of magnitude of points to preserve statistical accuracy.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: e.garcia</title>
		<link>http://semanticvoid.com/blog/2007/02/23/similarity-measure-cosine-similarity-or-euclidean-distance-or-both/comment-page-1/#comment-24353</link>
		<dc:creator>e.garcia</dc:creator>
		<pubDate>Thu, 03 Jul 2008 17:47:02 +0000</pubDate>
		<guid isPermaLink="false">http://semanticvoid.com/blog/2007/02/23/similarity-measure-cosine-similarity-or-euclidean-distance-or-both/#comment-24353</guid>
		<description>To all above:

You cannot blindfold exchange distance and similarity without knowing what is the model and the subject of that model.
http://www.miislita.com/searchito/binary-similarity-calculator.html</description>
		<content:encoded><![CDATA[<p>To all above:</p>
<p>You cannot blindfold exchange distance and similarity without knowing what is the model and the subject of that model.<br />
<a href="http://www.miislita.com/searchito/binary-similarity-calculator.html" rel="nofollow">http://www.miislita.com/searchito/binary-similarity-calculator.html</a></p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Anand Kishore</title>
		<link>http://semanticvoid.com/blog/2007/02/23/similarity-measure-cosine-similarity-or-euclidean-distance-or-both/comment-page-1/#comment-23381</link>
		<dc:creator>Anand Kishore</dc:creator>
		<pubDate>Sun, 15 Jun 2008 05:05:26 +0000</pubDate>
		<guid isPermaLink="false">http://semanticvoid.com/blog/2007/02/23/similarity-measure-cosine-similarity-or-euclidean-distance-or-both/#comment-23381</guid>
		<description>Jack: Well she was making the point that the two cannot be combined together to give you a number that takes both similarity and magnitude into account.</description>
		<content:encoded><![CDATA[<p>Jack: Well she was making the point that the two cannot be combined together to give you a number that takes both similarity and magnitude into account.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Jack</title>
		<link>http://semanticvoid.com/blog/2007/02/23/similarity-measure-cosine-similarity-or-euclidean-distance-or-both/comment-page-1/#comment-23277</link>
		<dc:creator>Jack</dc:creator>
		<pubDate>Thu, 12 Jun 2008 13:20:37 +0000</pubDate>
		<guid isPermaLink="false">http://semanticvoid.com/blog/2007/02/23/similarity-measure-cosine-similarity-or-euclidean-distance-or-both/#comment-23277</guid>
		<description>I&#039;m not an expert but I&#039;m not sure what point Dr. E. Garcia is trying to make here...cosine similarity does use magnitudes in the calculation but the calculation in fact normalizes the magnitudes, and the output of that calculation is an angle, and nothing else.</description>
		<content:encoded><![CDATA[<p>I&#8217;m not an expert but I&#8217;m not sure what point Dr. E. Garcia is trying to make here&#8230;cosine similarity does use magnitudes in the calculation but the calculation in fact normalizes the magnitudes, and the output of that calculation is an angle, and nothing else.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: nthio</title>
		<link>http://semanticvoid.com/blog/2007/02/23/similarity-measure-cosine-similarity-or-euclidean-distance-or-both/comment-page-1/#comment-22035</link>
		<dc:creator>nthio</dc:creator>
		<pubDate>Fri, 16 May 2008 08:44:56 +0000</pubDate>
		<guid isPermaLink="false">http://semanticvoid.com/blog/2007/02/23/similarity-measure-cosine-similarity-or-euclidean-distance-or-both/#comment-22035</guid>
		<description>Hi Anand and Dr. E.Garcia,
May I quote the original post:
&quot;For example, the cosine between a document which has ‘machine’ occurring 3 times and ‘learning’ 4 times and another document which has ‘machine’ occurring 300 times and ‘learning’ 400 times will hint that the two documents are pointing in almost the same direction. If magnitude (euclidean distance) was taken into account, the results would be quite different.&quot;

   I&#039;m kind of getting what you (Anand) intended to say from the early post, which is the cosine similarity is kind of normalizing the magnitude of the vector in its calculation. For example the coordinates A(3,4), B(5,12) and B&#039;(500,1200); the cosine similarity between A and B, and A and B&#039; are the same; however, the euclidean distance is of course different.  
   In a context of query and document searhing, the larger the document tends to scales up (in ideal case while it may not always practical; scales up the vector dimension equally, so although the magnitude i.e. the size of the document increases, the vector direction remains the same). In this case, cosine similarity may have a desirable normalizing effect compared to using the euclidean distance, which will tends to favor document with similar scale (as well as the direction) from the query. In such application, we may expect that the magnitude of query can be much less than the document searched - thus using euclidean will tend to favor smaller documents.
   Perhaps if the application is different, such as comparing the strict similarity between two documents, euclidean distance may have advantage if we intend to account similar magnitude of both documents.</description>
		<content:encoded><![CDATA[<p>Hi Anand and Dr. E.Garcia,<br />
May I quote the original post:<br />
&#8220;For example, the cosine between a document which has ‘machine’ occurring 3 times and ‘learning’ 4 times and another document which has ‘machine’ occurring 300 times and ‘learning’ 400 times will hint that the two documents are pointing in almost the same direction. If magnitude (euclidean distance) was taken into account, the results would be quite different.&#8221;</p>
<p>   I&#8217;m kind of getting what you (Anand) intended to say from the early post, which is the cosine similarity is kind of normalizing the magnitude of the vector in its calculation. For example the coordinates A(3,4), B(5,12) and B&#8217;(500,1200); the cosine similarity between A and B, and A and B&#8217; are the same; however, the euclidean distance is of course different.<br />
   In a context of query and document searhing, the larger the document tends to scales up (in ideal case while it may not always practical; scales up the vector dimension equally, so although the magnitude i.e. the size of the document increases, the vector direction remains the same). In this case, cosine similarity may have a desirable normalizing effect compared to using the euclidean distance, which will tends to favor document with similar scale (as well as the direction) from the query. In such application, we may expect that the magnitude of query can be much less than the document searched &#8211; thus using euclidean will tend to favor smaller documents.<br />
   Perhaps if the application is different, such as comparing the strict similarity between two documents, euclidean distance may have advantage if we intend to account similar magnitude of both documents.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Robbie Haertel</title>
		<link>http://semanticvoid.com/blog/2007/02/23/similarity-measure-cosine-similarity-or-euclidean-distance-or-both/comment-page-1/#comment-20642</link>
		<dc:creator>Robbie Haertel</dc:creator>
		<pubDate>Fri, 25 Apr 2008 21:27:14 +0000</pubDate>
		<guid isPermaLink="false">http://semanticvoid.com/blog/2007/02/23/similarity-measure-cosine-similarity-or-euclidean-distance-or-both/#comment-20642</guid>
		<description>If vectors X and Y are L2 normalized, it is possible to prove that using the cosine distance (1 - cosine similarity) is interchangeable with the euclidean distance.</description>
		<content:encoded><![CDATA[<p>If vectors X and Y are L2 normalized, it is possible to prove that using the cosine distance (1 &#8211; cosine similarity) is interchangeable with the euclidean distance.</p>
]]></content:encoded>
	</item>
</channel>
</rss>
