Profanity is often prevalent in user generated content (like comments). Websites that do not want to display such profane comments/content currently employ masking as a solution to get rid of profanity. Masking replaces the profanity in the content with characters like ####. The masked content still though conveys the existence of profanity to the user. Humans have built up a great language model to infer missing words. Try it yourself – it should be easy for you to guess a bunch of profanity words for the following sentence:
What the ####!
My hack (Bleep) for the Yahoo! Spring ’11 Hackday is yet another natural language hack that tries to remove the profanity from a comment without altering the semantics of the content. In brief, removing the profanity word from the content makes the parse tree less probable. The algorithm tries to alter this improbable parse tree to find the best local parse tree.
Following are some corrections suggested by Bleep:
Apart from the content there are various features from metadata (like IP etc) which can help tell a spammer and regular user apart. Following are results of some data analysis (done on roughly 8000+ comments) which speak of another feature which proves to be a good discriminator. Hopefully this will aid others fighting spam/abuse (if not already using a similar feature).
Wikipedia lists the average words per minute (wpm) for a regular internet user at around 30 wpm. With a conversion factor of 5 to characters per minute (cpm), this amounts to ~2.5 characters per second. The green line in the plot depicts a projection of the content length a user could have typed in the given time with an average typing speed (of ~2.5 chars per sec). We observe that this line clearly separates out most spam from ham. The ham posts that fall above this line are usually trolls (as observed).
This turns out to be a nice feature to tell spammers (bots and non-bots), trolls, and regular users apart. Bots often fail the turing test and don’t try hard enough to be more human like. Non-bot spammers on the other hand have to take the pains to type their spam comment repeatedly and usually end up pasting it.Try out the example here .
So spammers fix yourselves cause we have the speed gun to pull you over.