For a recent project, we (sudheer_624 and I) have had to deal with developing algorithms to extract the true content from any given web page. By true content I mean the text excluding the ads, navigational links/text, etc even excluding comments (if any). Thus, given a blog post we are interested in extracting just the content of the post and not the comments and other surrounding text. We did not come across any dataset for the given task that would let us evaluate our algorithms. We recently generated our own dataset for this purpose and would like to share it with anyone tackling a similar problem.
The dataset contains the html source and text content (true content) for around ~4000 webpages. One metric to measure your algorithm against this dataset could be the edit distance. If you do use this dataset, it would be great if you could share the results of your algorithms for benchmarks to compare against. I’ll be updating this post with the accuracy of our algorithm soon enough.