Archive for the Natural Language Processing Category


| September 21st, 2011

Many a times I’ve stared at Explored Flickr Photos and tried grokking its artistic nuances. Due to lack of artistic sensibility, at times I fail to understand the techniques or properties that the photographer used or intended to capture. But the Flickr community is brimming with experts who often chime in about what they like/see in comments. My #nlproc hack (for the upcoming Yahoo! Winter Hackday) aims to solve this by summarizing this expert knowledge (wisdom of crowd) for a photograph.

What You Comment Is What You See (WYCIWYS) is a Flickr hack that harnesses the comments of photos to determine the attributes/properties of the photo that people are talking about. It also gives a sentiment score (+ve) for each attribute to help a user gauge what other users find most interesting about a photo. Following are some outputs for WSCIWYS (click to zoom):

click to zoom

click to zoomclick to zoomclick to zoomclick to zoom

what the bleep!

| March 4th, 2011

Profanity is often prevalent in user generated content (like comments). Websites that do not want to display such profane comments/content currently employ masking as a solution to get rid of profanity. Masking replaces the profanity in the content with characters like ####. The masked content still though conveys the existence of profanity to the user. Humans have built up a great language model to infer missing words. Try it yourself – it should be easy for you to guess a bunch of profanity words for the following sentence:

What the ####!

My hack (Bleep) for the Yahoo! Spring ’11 Hackday is yet another natural language hack that tries to remove the profanity from a comment without altering the semantics of the content. In brief, removing the profanity word from the content makes the parse tree less probable. The algorithm tries to alter this improbable parse tree to find the best local parse tree.

Following are some corrections suggested by Bleep:

stop words

| August 24th, 2010

In a recent implementation for a near duplicate detection task I relied on stop words as key features in extracting signatures from text. The results turned out to be good but that’s not what I’m focusing on here. This was quite contrary to the mindset in the IR/NLP domain we have been accustomed to, where these words are considered meaningless and need to be got rid of before building any model/index. These word on the other hand encode a plethora of information like tense, plurality, (un)certainty, subjectivity and more. They bind the semantics of a sentence together and give them context. Yet (atleast in the IR sense) we give them a negative connotation (STOP/NN -0.140192 sentiment). I would go a step ahead by saying that we should stop calling them *stop* words and instead accept the inability of some IR systems of making correct use of them. How about *glue* words for a change? Or maybe not.

PS: Incase you are looking for a list of stop words for different languages here is a good list –

Reading Less Is Reading More

| October 7th, 2009

If information is what drives you to the internet, like me, you might be spending roughly 60-70% of your time online reading blogs, news and feeds (not to forget twitter). For me at least, reading online has superseded email (and updating social networks) as the most time consuming activity. And yet everyone is busy generating more content rather than finding a solution to consume all this information. We are trying to tackle this problem precisely with Dygest. At its core Dygest is a summarization engine that tries to sift through all the noise and present only the *real* content/news contained in any (news) article/text. Recently, we released an experimental version of a feed summarizer that uses the Dygest engine to summarize blogposts/news for any RSS/ATOM feed. This summarized feed can be subscribed in any feed reader like Bloglines, Google Reader etc.

NOTE: A feed that has not been encountered by our system ever before should be summarized in a couple of minutes.

Feed Summarizer

On the whole with Dygest, reading blogs has now become much faster, much more concise and consuming information has become a great deal easier. Imagine the time saved reading the summarized version as compared to the original post (also you are not overwhelmed with useless information). See for yourself below:

Original Post

Original Post

Summarized Post

Summarized Post

While you might have the urge to head over to Dygest and summarize your entire subscription list on Google Reader, I would recommend reading this post a bit further for some real cool stuff we have in store. If you must though – click here to Dygest.

Summarizing Your Twitter Links

Readtwit is a really cool service launched recently, which extracts links from your twitter feed and packages them in a clean RSS format. The awesome combination of Readtwit along with Dygest yields a summarized twitter feed delivered to your favorite feed reader.

Steps to get a summarized twitter feed:

(1) Sign into Readtwit.
(2) Copy the link on the ‘Get me the feed’ button:

(3) Paste this link into the Dygest interface and subscribe to the summarized feed returned in your favorite feed reader.

More To Come

This is just an experimental release of Dygest and so do send in your feedback on the summaries and help us improve. In the coming months we are working on improving the algorithms and churning out other great applications of Dygest (there is something really cool in the works). So while we are busy teaching computers to read, Dygest your feeds – because reading less is reading more.

Follow us on twitter – @dygest