Stupid Stemmer Tricks

A “stemmer” is software that returns a “stem” for a word, usually removing inflections like plural or past tense. The people writing stemmers seem to think they are finished when the mapping is linguistically sensible, but that leaves plenty of room for dumb behavior. Just because it is in the dictionary doesn’t mean it is the right answer.

Here are some legal, but not very useful things that our stemmers have done for us:

  • “US” to “we” (wrong answer for “US Mail”)
  • “best.com” to “good.com” (oops, don’t run URLs through the stemmer)
  • “number” to “numb” (correct, but when is the last time you meant “more numb”?)
  • “tracking meeting” to “track meet” (gerund to verb that can also be a noun, bleah)

The stemmer people say “use part of speech tagging”, but we need to do exactly the same transformations to the documents and to the queries. Queries rarely have enough text for the tagger to work.

A search-tuned stemmer would be really nice to have. I’ve got some ideas: leave gerunds alone, don’t treat comparatives and superlatives as inflections, and prefer noun-to-noun mapping. It would need to be relevance-tested with real queries against a real corpus, of course.

Advertisements

2 thoughts on “Stupid Stemmer Tricks

  1. I’ve noticed that all stemmers seem to have a few words that are hard. But that second example… is that your stemmer or is that a synonym tool in work?

    Like

  2. In the second example, the engine is also breaking words at periods, so hostnames are separated into stemmable parts.
    Other than that, it is plain ol’ stemming. I’ve used two different stemmers that consider “best” and “better” to be inflections of “good”. That strikes me as wrong. A Consumer Reports “Best Buy” is not a “good buy”. I’m sure they have some solid linguistic justification for that, but it isn’t useful behavior, regardless of the theory.

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s