Stupid Stemmer Tricks

A “stemmer” is software that returns a “stem” for a word, usually removing inflections like plural or past tense. The people writing stemmers seem to think they are finished when the mapping is linguistically sensible, but that leaves plenty of room for dumb behavior. Just because it is in the dictionary doesn’t mean it is the right answer.

Here are some legal, but not very useful things that our stemmers have done for us:

“US” to “we” (wrong answer for “US Mail”)
“best.com” to “good.com” (oops, don’t run URLs through the stemmer)
“number” to “numb” (correct, but when is the last time you meant “more numb”?)
“tracking meeting” to “track meet” (gerund to verb that can also be a noun, bleah)

The stemmer people say “use part of speech tagging”, but we need to do exactly the same transformations to the documents and to the queries. Queries rarely have enough text for the tagger to work.

A search-tuned stemmer would be really nice to have. I’ve got some ideas: leave gerunds alone, don’t treat comparatives and superlatives as inflections, and prefer noun-to-noun mapping. It would need to be relevance-tested with real queries against a real corpus, of course.

2 thoughts on “Stupid Stemmer Tricks”

I’ve noticed that all stemmers seem to have a few words that are hard. But that second example… is that your stemmer or is that a synonym tool in work?

LikeLike

Reply ↓

In the second example, the engine is also breaking words at periods, so hostnames are separated into stemmable parts.
Other than that, it is plain ol’ stemming. I’ve used two different stemmers that consider “best” and “better” to be inflections of “good”. That strikes me as wrong. A Consumer Reports “Best Buy” is not a “good buy”. I’m sure they have some solid linguistic justification for that, but it isn’t useful behavior, regardless of the theory.

LikeLike

Reply ↓

Most Casual Observer

In physics class, many things are intuitively obvious to the most casual observer. Welcome to my casual observations.

Stupid Stemmer Tricks

2 thoughts on “Stupid Stemmer Tricks”

Leave a comment Cancel reply

Share this:

Related

2 thoughts on “Stupid Stemmer Tricks”

Leave a comment Cancel reply