stop words

| August 24th, 2010

In a recent implementation for a near duplicate detection task I relied on stop words as key features in extracting signatures from text. The results turned out to be good but that’s not what I’m focusing on here. This was quite contrary to the mindset in the IR/NLP domain we have been accustomed to, where these words are considered meaningless and need to be got rid of before building any model/index. These word on the other hand encode a plethora of information like tense, plurality, (un)certainty, subjectivity and more. They bind the semantics of a sentence together and give them context. Yet (atleast in the IR sense) we give them a negative connotation (STOP/NN -0.140192 sentiment). I would go a step ahead by saying that we should stop calling them *stop* words and instead accept the inability of some IR systems of making correct use of them. How about *glue* words for a change? Or maybe not.

PS: Incase you are looking for a list of stop words for different languages here is a good list –

  • mat kelcey

    I’ve always been in two minds about stop words…

    My experience has been that given enough data (whatever “enough” might mean, it is so subjective to the task at hand) ignoring special handling of stop words has never made a big difference to the quality of results.

    It’s always nice to not need special handling code!


  • Nicolas Torzec

    To my knowledge the concept of “stop words” makes much more sense for IR than for NLP.

    Actually most of the definitions for stop words – and these definitions vary from one system to another – are coming from the IR world, not the NLP world. And as far as I know, the term has been credited to Hans Peter Luhn, one of the pioneers in information retrieval, who used the concept in his designs.

    IR commonly defines stop words as words or multi-words that do not appear in indices because they are either insignificant (i.e., articles, prepositions) or so common that the results would be higher than the system can handle.

    But because of homographs, it is not that simple even for IR. Depending on the context, some sequences of tokens might be considered as sequences of stop words, or as plain meaningful words. For example, “Who” and “The” might be considered as two stop words according to this definition, but “The Who” is definitely not a stop word, especially if you are building an music index…

    For NLP, every token is meaningful when you are trying to analyze the syntax and semantic of sentences for applications such as sentiment analysis, (natural language) question answering, text-to-speech synthesis, etc.

    So I agree with you: the concept of stop words is more related to the inability of (some) IR systems to make correct use of them, than to anything else :)

    PS: Great tweets and blog BTW.