A “stemmer” is software that returns a “stem” for a word, usually removing inflections like plural or past tense. The people writing stemmers seem to think they are finished when the mapping is linguistically sensible, but that leaves plenty of room for dumb behavior. Just because it is in the dictionary doesn’t mean it is the right answer.
Here are some legal, but not very useful things that our stemmers have done for us:
- “US” to “we” (wrong answer for “US Mail”)
- “best.com” to “good.com” (oops, don’t run URLs through the stemmer)
- “number” to “numb” (correct, but when is the last time you meant “more numb”?)
- “tracking meeting” to “track meet” (gerund to verb that can also be a noun, bleah)
The stemmer people say “use part of speech tagging”, but we need to do exactly the same transformations to the documents and to the queries. Queries rarely have enough text for the tagger to work.
A search-tuned stemmer would be really nice to have. I’ve got some ideas: leave gerunds alone, don’t treat comparatives and superlatives as inflections, and prefer noun-to-noun mapping. It would need to be relevance-tested with real queries against a real corpus, of course.