Stupid Stemmer Tricks

A “stemmer” is software that returns a “stem” for a word, usually removing inflections like plural or past tense. The people writing stemmers seem to think they are finished when the mapping is linguistically sensible, but that leaves plenty of room for dumb behavior. Just because it is in the dictionary doesn’t mean it is the right answer.

Here are some legal, but not very useful things that our stemmers have done for us:

  • “US” to “we” (wrong answer for “US Mail”)
  • “best.com” to “good.com” (oops, don’t run URLs through the stemmer)
  • “number” to “numb” (correct, but when is the last time you meant “more numb”?)
  • “tracking meeting” to “track meet” (gerund to verb that can also be a noun, bleah)

The stemmer people say “use part of speech tagging”, but we need to do exactly the same transformations to the documents and to the queries. Queries rarely have enough text for the tagger to work.

A search-tuned stemmer would be really nice to have. I’ve got some ideas: leave gerunds alone, don’t treat comparatives and superlatives as inflections, and prefer noun-to-noun mapping. It would need to be relevance-tested with real queries against a real corpus, of course.

The Right Tool for the Job

My son watered our plants on the patio with the hose, but missed a couple (and watered a couple of chairs, too). We pointed out the missed plants, so he got his super soaker, loaded that up, and used it to water them. It was just the right amount of water for two plants.

When choosing between tools for a job, why not choose the fun one?

HTTP Compression is not an Obvious Win

Tim Bray posts about How to Send Data and asks, “if you’re sending anything across the Net, why would you ever send it uncompressed?” Mostly because it is a lot messier than it should be and the payoff is small. I’ll survey the problems we ran into when we added HTTP compression to Ultraseek.

Tim also brings up encryption. That has many of the same problems, but the payoff is much, much bigger, so it is usually worth the hassle.

If you can store your content compressed, some of these problems go away, but not all. Compressing on the fly is often not worth the bother.

Algorithm Compatibility: The spec lists three standard compression algorithms: compress, deflate and gzip. Compress isn’t as effective and browsers implement deflate in two incompatible ways, so the first step is to only send gzip. With gzip, you still have to decide on a compression level.

Keep-alive: For HTTP keep-alive, you need to specify the content length in the HTTP header. But with compression, you don’t know that length until after the compression, so you can’t send the header to the client until the compression is finished. This can add substantial delay. You avoid this by using chunked transfer coding, an additional complexity.

Server-side Latency: A great trick for responsive servers is to push content out the socket as soon as you have it. This is especially important if the content takes a while to generate. In our case, you can list all the URLs the spider knows about for a site. This can take a while. So, flush out the template HTML, then flush every N list items. If your content compresses really well (an HTML list of URLs may see 10X compression), then you have a choice of pushing out short packets or making the customer wait. Either way, compression has not improved the user-visible performance.

TCP Latency: If the latency is dominated by network round-trips or new connections, compression won’t help much. New connections go through TCP slow start, so reducing your page from six packets to four won’t eliminate a single round trip. Slow start doubles the outstanding packets for each round trip, so you have 1, 2, 4, … in transit. One RTT for one packet sent, two for three, three for seven, until you hit the max in-transit buffer size.

Browser Compatibility: The deflate algorithm mess is one source of browser incompatibility, but there are also older browsers that implement compression badly or only recognize “x-gzip” in the response headers. A really robust implementation may need to check the user agent before sending compressed responses.

Compressed Formats: Compressing an already-compressed format is a complete waste of time, so you need to make sure to not compress JPEGs, zip archives, etc.

Hard to Measure: Good performance measures for this need a range of tests over different real networks with varying bandwidth/delay properties. In our tests, we could not demonstrate conclusive improvement. But it didn’t hurt, so we leave it turned on.

Way back in 1997, the LAN-based tests of HTTP compression showed small improvements, around 15-25%. That is not a meaningful different for user interface, and maybe not for net utilization. If there is any increase in latency to start rendering the page, that will be a big loss for responsiveness.

Articles on ultraseek.com

I write occasional articles on search and Ultraseek for ultraseek.com, so I’ve collected links to them all from here. I’ve clipped the first paragraph or two of each one so you can see whether you’d like to read the entire article.

Relevance and User Satisfaction

Search relevance is usually thought of as a statistic that measures whether the search results match the query. That is useful in the lab, but not as useful for a search installation.

When search is part of a site, we need to understand how it helps the users of that site. Can they find things quickly? Are they comfortable with the search?

My Favorite Customer Problem

We are concerned about all of the problems reported by our customers, but there is one problem I don’t mind hearing about.

Keeping Your Index Fresh with Add URL

Everyone wants their search index to be an accurate, timely reflection of their content. Ultraseek automatically revisits pages to find new URLs, and that is very effective, but some sites have even stronger reqirements for how quickly documents need to be available in search results. This is called “index freshness.” A stale index frequently misses new pages and has old information including pages that have already been deleted, and old copies of pages that have changed since they were indexed. For maximum index freshness, use Ultraseek’s Add URL feature for notifications of deleted, changed, or new URLs.

Don’t Reindex Every Week!

If you have used other search engines, you probably had to manually configure your indexing schedule to make sure new content was found and indexed. This is not necessary with Ultraseek.

Ultraseek has “continuous spidering with real-time indexing and adaptive revisit intervals.” It sounds complicated, but it means that Ultraseek will automatically spider most pages at the right times.

Why Not Use “All Terms” Queries?

Google, Yahoo!, and MSN all default to matching all of your search terms, but Ultraseek does not. Why? What do you say when your users want Ultraseek to “work like Google”?

In most cases, it is good for an enterprise engine to behave like the WWW engines because users can intuitively transfer their searching skills. But, this is a case where doing the right thing is more expensive for the WWW engines, and more reasonable for enterprises.

It looks like I’ve been careful to write useful leads, because selecting the first few sentences makes a pretty fair intro to each article.