Archive for the Web Category

Web Content Extraction Dataset

| August 22nd, 2009

For a recent project, we (sudheer_624 and I) have had to deal with developing algorithms to extract the true content from any given web page. By true content I mean the text excluding the ads, navigational links/text, etc even excluding comments (if any). Thus, given a blog post we are interested in extracting just the content of the post and not the comments and other surrounding text. We did not come across any dataset for the given task that would let us evaluate our algorithms. We recently generated our own dataset for this purpose and would like to share it with anyone tackling a similar problem.

The dataset contains the html source and text content (true content) for around ~4000 webpages. One metric to measure your algorithm against this dataset could be the edit distance. If you do use this dataset, it would be great if you could share the results of your algorithms for benchmarks to compare against. I’ll be updating this post with the accuracy of our algorithm soon enough.

Download the dataset here (gzipped)

Dygest Your Search

| March 19th, 2009

Update: This hack won the coveted ‘Search’ category award.

For the last couple of days, I and @sudheer_624 have been busy working on this hack for a Yahoo! Hackday. Although still a prototype, the hack has turned out to be interesting so we thought of putting it out for others to play around with.

Dygest (pronounced as ‘digest’ – thanks to @bluesmoon) is aimed at changing the conventional way of displaying search context via a snippet to a more informative, machine generated document summary. There two kinds of relevance for evaluating search results:

  • Vertical relevance: determined by the ranking algorithms.
  • Horizontal relevance: the contextual information made available to the user about the result – Searchmonkey is a good initiative on this front.

The current way of displaying this context is via a snippet of text under every result. This snippet shows the neighborhood of the occurrence of the query terms. Usually this information is not rich enough for a searcher to make the right judgement about the result. This causes the searcher to switch back and forth between the documents and the search results if the the page is not relevant. This can be frustrating at times.

Dygest aims to solve this by either replacing or enhancing the current search snippet with a summary of the result page. At its core lies a summarization engine which figures out what the *real* content of the page is (distinguishing it from the other junk like surrounding text, navigational text, comments etc) and then performs text summarization on this content. The summary of the page is then displayed to the user via the appropriate interface. How cool is that?

The user no longer needs to click on irrelevant links. He/She can perceive the theme/important facts of the page from right within the results page. The other advantage of this is that it gives the user a good overview of the query topic – he no longer needs to spend time reading many long documents but rather read a few summaries from the top results to get a good overview of the subject. This is particularly well suited for mobile devices where its frustrating to switch back and forth between pages and the search results. This is also fit for news articles where we just need the important facts about the story.

Well, here is an example to convince you. A search for ‘Carol Bartz’ yields the following result which at the first glance is not at all informative.

Enhancing the existing view with an abstract of the page helps gauge the content and theme of the document. This would now look like:

Dygest outputs the following summaries for the query ‘Iran‘ restricted to Yahoo! News:

And following for ‘Obama stimulus plan‘:

Currently, Dygest has two interfaces – (1) a search interface powered by yahoo boss and (2) a searchmonkey plugin. Its just a prototype so be kind and don’t be too judgmental.

Start dygesting here.



I started working on my second weekend project, guess I’ll do something small every week. This one is an extension to LifeLogger. The aim is to analyze ones daily and weekly browsing history and extract themes which could aid in recommendations. It is still a ‘work in progress’ – currently I have been able to generate the following visualizations:

The following visualization depicts the dominant keywords/topics for one day (the terms are stemmed):

I had been reading a couple of Yahoo! related articles and visualization blogs. This is captured by the above visualization – but there is still alot of noise which I need to get rid of.

The next visualization depicts the linkages and clusters for the keywords. There exists a link between two terms if they occur in the same document. [may take sometime to load - you'll need to zoom in to get a better look - click on 'compute layout' if the clusters don't show]

Both the above visualizations depict important metrics that could be used to extract dominant themes from the browsing history. Dominance should not be just inferred from frequency but also from the prevalent of a term across multiple pages. I still need to work on removing noise and running this on larger datasets like browsing history for a week or so. If you have any ideas or good papers to recommend that would be nice.

Happy Birthday DARPA!

| February 8th, 2008

Defense Advanced Research Projects Agency (DARPA) celebrated its 50th anniversary today. Who knew 50 years ago that contributions by this defense research agency would evolve into the present Internet. Hadn’t it been for their research you probably wouldn’t be reading this blog post today. Most of us know of DARPA because of the ARPANET or the DARPA Urban Challenge (ai geeks mostly). Its time we knew why and under what circumstances DARPA was created.

Read the original DOD DARPA directive

President Eisenhower established the Advanced Research Projects Agency (ARPA) 50 years ago in response to the Soviet Union’s Sputnik launch, which surprised and embarrassed the United States as the Soviets became the first nation to successfully launch a satellite into space. DARPA’s 1958 charter charged the Agency to perform certain advanced research and development projects, with the primary mission of ensuring that the United States would never again be surprised by another nation’s technological advancement. The list of DARPA’s contributions includes the Saturn V rocket, the ARPANET (which laid foundations for the Internet).

On this very day 50 years ago, Feburary 7th 1958, the following short and concise document started it all: the original DOD directive.