Progressive Reranking

Occasionally, people want to rescore search results and rerank them after they are returned from the search engine. The usual answer is that it is very slow because you need to fetch the full list of results, score them, then sort by the new score.

You don’t really want to rerank the full list. If your search engine is returning very useful results at the end of the list, you have big problems, bigger than you can fix with reranking. The normal case is that you are moving some results up and down a few positions, and users will only look at the first ten results. So let’s look at how to do that efficiently.

A Pull Model for Search Results

The Ultraseek XPA Java library has classes to combine results from multiple engines and rerank them by global tf.idf scores. This class can be used for other kinds of reranking. You can use write your own code, of course, but I’ll use the XPA classes for this example.

CompositeSearchable (and CompositeSearchResultList) reads results into a priority queue, ordered by score. When client code requests a result from CompositeSearchable, that is removed from the queue, added to a “frozen” list of scored results, the PQ is replenished with one new result, that result is scored and inserted into the queue (positioned according to its score), then the requested result is returned.

The frozen list allows downstream searchables to treat the already accessed portion of the search results as read-only and unchanging.

This is a lot faster than it would seem, because the next result is usually already locally cached in the local UltraseekServer class because it was part of the chunk of results requested from the server.

If a CompositeSearchable is merging results from different sources, the replacement result always comes from the same source as the one moved to the frozen list. If all the good results are from one collection, this replacement policy guarantees that they will all be shown.

Applying Score Adjustments

Assume we have a score for each document that represents a priori goodness. Perhaps it is a measure of how long it would take monkeys to type that document (MonkeyRank). This score ranges from 0..1 with 1 meaning good.

SearchResult already has a document quality measure which matches this, with a range of -1.0..+1.0. We can use the default result scorer, but replace (or modify) Ultraseek’s document quality with our own MonkeyRank.

We do this with SearchResultWrapper, which delegates all methods to a wrapped SearchResult. We’ll scale the MonkeyRank value to a range of -0.2..+0.2. This is a bit more powerful than the Ultraseek quality score, which goes from -0.16..+0.15.

class MonkeySearchResult extends SearchResultWrapper {
public float getQuality() {
float mr = MonkeyRank.getRank(baseResult.getURL());
float scaledMR = (mr * 0.4) - 0.2;
return scaledMR + baseResult.getQuality();
}
}

Then we make a MonkeyResultScorer class. It wraps the search result, then uses DefaultResultScorer.

class MonkeyResultScorer implements ResultScorer {
public float score(SearchResult sr, SearchResultList srl) {
return CompositeSearchable.RESULT_SCORER.score(
new MonkeySearchResult(sr),
srl);
}
}

The only thing left is to figure out the maximum number of rank positions that a score boost can move a result up the list. That determines the read-ahead level. We only care about movement to a more relevant rank (earlier in the list). Pushing results down the list is much easier.

Now you need to set the read-ahead. If the read-ahead is too small, then some good results will not be re-ranked as high as they should be. If the read-ahead is too big, performance will suffer because you will need to fetch more results from the engine before the first one can be shown.

Here are a few ways to guess at the read-ahead size. You’ll probably need to do these for quite a few queries in order to get good numbers. I’d use the top 100 single-word and top 100 multi-word queries. Those are scored somewhat differently in the engine, and multi-word queries are underrepresented in the top 100. Overall, they are more than half of queries.

  1. Look at the maximum score modification (+/- 0.2) and then at the scoring for result lists. Count how far a document would move if it was scored 0.2 higher and everything above it was scored 0.2 lower. This is the worst case read-ahead value for your data and queries. It will be much too large.
  2. Compare a raw list and a re-ranked list and look at the biggest movements up the list. This will be the typical read-ahead value for your data and queries, probably a much smaller number than the above.
  3. Do either of the above, but only count results that move into the top three slots (“above the fold”) or the top ten slots (first page). This number is more practical, especially for scores based on link-graph data, like PageRank. Those tend to make a big difference for a few documents, thanks to the power-law distribution in the web link graph. Also, if your scoring doesn’t move hits above the fold, it is a waste of time and you don’t need to implement any of this!

Dirty Pages: Popularity Ranking

A few years ago I was talking with a friend about using access frequency (popularity) as a factor in ranking web pages. He pointed out that this works well for dead trees materials, too.

He used to go to the university library and start down a row of bound math journals. He’d pull one out and look at the non-bound edge. If there was a section where the page edges were dirty, that was a paper that lots of people had read. So he would read it. Then go to the next volume. Going purely by popularity lead him to top papers in areas that he might not have looked at otherwise.

Interestingly, this only works when the materials are shared, like in a library.

Future of Bibliographic Control

I went to the Library of Congress open meeting on bibliographic issues a couple of weeks ago. Interesting, but I think they have a long way to go. This meeting was a good stab at understanding users, both searchers and catalogers, but the tricky part is the model and system interface. How to support links and mashups and massive content generation and cataloging? There was some talk about tagging, but the anti-spam algorithms needed for low-trust, low-authority cataloging are far beyond the expertise and budgets of libraries.

The official writeup and lots of notes by Karen Coyle are good places for more thorough coverage.

Bernie Hurley from UC Berkeley gave a talk on issues today with MARC (see Karen’s notes). This was far more interesting than I expected, mostly because it was fact-based. Some tidbits:

  • MARC cataloging is expensive, even when outsourced to India
  • thesis cataloging is different, the subject areas tend to be outside of the established categories
  • MARC has more information than they use (have 175 fields but 2/3 of search is on just 3 and they show a maximum of 27)
  • it does not have the information that is needed for search and faceted browse (from Andrew Pace, NC State)
  • the book height and depth are measured for shelving, but we need the weight and thickness for mailing them (also from Andrew)

The main fields they use are:

  • Author
  • Title
  • Subject keywords
  • Date for sorting
  • LC Classification

Several speakers, both from the podium and the floor, were pinning their hopes on full-text search. I presume that is because they haven’t tried implementing it. I appreciate the optimism, but full-text is Muggle Technology, not magic. Full-text is great for finding the next 20% or 30% of stuff, but most of your good results come from great metadata (including links and attention data). As Dan Clancy (Google Book and Google Scholar) pointed out, book search is much harder than web search precisely because you don’t have as much link data (metadata). No one had any good ideas about how to get access to all that text so it could be indexed. Well, ideas besides Google Book.

Hey, why wasn’t Brewster Kahle invited? Maybe the LoC already knows what he thinks, but a position paper would be handy for the rest of us.

On-line access to content is working OK. The only complaints were about the URL fields in library catalogs. If you don’t know what MARC is, take a moment to look over MARC 856, Electronic Location and Access. It’s a little more complicated than the <a> tag.

The day started with an interesting and dangerous talk by Timothy Burke on the wonders and difficulties of serious research using our current tools (see Karen’s notes). It was mostly about searching techniques, though it wasn’t really explained that way. I would have been happier if he’d started with some terminology from Marcia Bates. The personal view was helpful, but this should be well-understood stuff by now.

The danger is aiming our tool efforts primarily at the expert user. That way lies disaster. There is really only one way to do this and succeed, and that is to follow the Rob Pike architectural rules:

  1. Simple things are simple.
  2. Hard things are possible.
  3. You don’t have to understand the whole system to use part of it.

Once you do this, the fancy tools can be built on top of it. If you design for the fancy stuff, the system will never be simple and it will probably be over-fit to an old problem (like MARC is today).

One other point from Burke’s presentation, universities no longer teach how to do literature search. Each discipline has general techniques and domain-specific ones (think chemical structure search), and this cannot be fobbed off on some other department. Striking out on your own might help avoid the prejudices of the field, but it can also mean missing and reinventing a lot of stuff.

I also saw some premature target lock-on. For example, converting subject headings to strings of standalone “subject keywords” is a lot of work, and is primarily useful for faceted browsing. Faceted browsing is good, but it is only one approach. We may be using facets because they are the best we can do with the HTML-based web apps of the past five years. Is it right for five years from now, when the conversion is done or did we just blow a wad of cash on another dead technology?

Finally, I should have asked Andrew Pace how much NC State spent on Endeca.

A side note — Google did a poor job of hosting this event. We had to park a half-mile away, there were no power strips for laptops, I couldn’t get back on the GoogleGuest net after 10AM, we had a “mini kitchen” instead of the usual wide array of free munchies (dang!), and lunch was “here’s a map of the area”. No one stood up to say “let me know if there are any problems”. A few people got power by unplugging the massage chair. Worst of all, the committee was ushered off to the Google Cafeteria, so there was no way to talk with any of them over lunch. Why have an open meeting if you aren’t going to eat together? That was golden time with users, and it was squandered.

A Fatter Long Tail? Nope.

Chris Anderson posted a really nice illustration a couple of months ago in Visualizing the Rise of the Long Tail. He shows three photos of mountain ranges that cover the same peaks but viewed from a different distances.

This would be a nice metaphor, but the underlying idea is wrong. On the other hand, the pictures are really pretty, so let’s take a look.

I’ve got a few Canon lens brochures somewhere in the garage with similar series of shots demonstrating the perspective of wide angle, normal, and telephoto lenses. Hmm, looking more closely these aren’t different perspectives at all, they are a cheap PhotoShop hack — they just stretch the aspect ratio and re-crop a single photo. Hmm, sounds more like PowerPoint-mind than real perspective, eh?

Anyway, here they are:

The idea behind this visualization is that the long, thin tail is going to get thicker (but stay long, I guess).

Unfortunately, the tail cannot simultaneously be a Zipf distribution (“long”) and be fat. The basic point of that shape is that the tail is both long and thin. If it gets fat it isn’t Zipf anymore. Since every popularity-based empirical distribution is Zipf, betting on that to change is a good way to lose money. The long tail is not going to rise until human nature changes.

Since that visualization is guaranteed to be wrong for overall web traffic, can it be reused for some other idea? It is a neat visual after all. I hate to waste those.

Instead, think of the top one as aggregate traffic and the bottom one as an individual’s traffic. The individual may spend a lot of time with local interests, things that aren’t every going to be a big hit (the fat head of the curve). The middle illustration shows groups of people. The students at Rice University will have some things in common that are not shared by the general public (or even faculty), so they share some “foothill” interests. This probably represents both local (Montrose-area Houston) and preference (ironic humor) interests. Pity, that.

When you add together traffic from related individuals, you see the group interests. When you add together traffic from a random sample (or lots of groups), you see the mainstream, the now-familiar long tail.

Really, all of these are different Zipf distributions, but I think the mountains are pretty, so I’ll stick with those. Maybe the groups should focus on different peaks.

A website tailored to a group should be mainly thinking about the bottom two illustrations. Your visitors will spend a lot of time in the foothills, and that is the ground you need to cover in detail. Community libraries have known this for a long time — their acquisition policy is tuned to local tastes. Paying attention to your customers is great, but don’t pretend that Zipf will bend to your business model.

The End of Open Spider Standards?

Yahoo and Microsoft have signed on to Google’s sitemap.xml format and published it at sitemaps.org. Two weeks ago, Yahoo announced support for wildcards in robots.txt which seems to be something similar to Google’s (non-standard) robots.txt.

It is well past time for updates to these, but it is sad that there was no attempt to include anyone outside of the Big Guys and that there is no invitation for anyone else to contribute. Robotstxt.org is still there, as is the ROBOTS mailing list, but both were bypassed.

The sitemap protocol is published under a Creative Commons license, but there is no mailing list, no wiki, not even a feedback e-mail address on the website. Questions are referred back to each individual search engine.

This is both sad and foolish. The big three are not the only bots in the world, and more eyes make a better spec. Submit this to the IETF, OK? It isn’t that hard, and the spec will be much, much better afterwards. Look at the improvement from RSS 2.0 to Atom (RFC 4287). It is like night and day.

Search Transparency and Trust

One way to increase user’s trust in your search engine is to give hints about how it works. When a search engine doesn’t work, the wrong results can be mysterious. That mystery leads to mistrust and to some interesting folklore about search engine algorithms.

Why is Australian radio associated with Disney? Well, because the engine thought that the Australian Broadcasting Corporation (ABC) was the same as the ABC that is part of Disney. With no explanation, that looks stupid, but with “ABC” highlighted, it is a reasonable mistake. That extra information makes the search engine more trusted.

Snippets: These days, we expect search engines to show passages from the matched documents and to highlight the matching words. Why is that important? Because it shows what the engine matched in that document and helps explain why it appears in the results. It eliminates the mystery so the user can say, “Oh you silly engine, that is the wrong ABC!”

Because you liked: At Netflix, the recommendations are introduced with an explanation. For our account, it looks like this today, “The following movies were chosen based on your interest in: Man with the Movie Camera, Gladiator, Harvey.” Without that hint, I would be genuinely confused by recommendations including Steamboat Bill, Jr. and The Last Samurai.

Group by Topic: Showing related topics is helpful, but a topic name is usually not enough information for people to trust the link. Instead, show the topic and the first two or three documents in that topic. This is especially useful when the user’s query doesn’t match up with the way the topics are organized. A search for “linux” could match press releases, products, knowledge base, and so on. Show the first few matches in each of those areas and the contents are much more clear.

Google, Yahoo, and MSN cannot reveal their algorithms, but you can (unless you use Google for your site search, oops). The WWW engines must defend against spammers taking advantage of loopholes in their scores. If you own your own content and your own search engine, you can reveal as much as you want. Just don’t make it all about the engine, the users are there for the content.

Google Blog Search Catches Up, Sort Of

Google Blog Search adds support for blog ping only two years after it shipped in Ultraseek (in version 5.3, September 2004). Want to bet the Google Appliance still doesn’t have it?

It wasn’t hard to implement, either. I put it together for Ultraseek in a couple of days. Clearly, most of their people are working on ads, not search. After all, ads make money and search costs money.

Nukers and Shills

Some fun new terminology in the “let’s spam the recommender” business. Paul Lamere of Sun Labs reports on a talk by Bamshad Mobasher at Recommenders ’06 about attacks on recommendation systems.

[Dr. Bamshad Mobasher] outlined two basic types of attacks: shilling – trying to promote an item and nuking – trying to demote an item. These types of attacks are quite real. The social site Digg is under constant attack by shills trying to get their story promoted to the front page.

Bamshad pointed to an example when a loosely organized group who didn’t like evangelist Pat Robertson managed to trick the Amazon recommender into linking his book “Six Steps to a Spiritual Life” with a book on anal sex for men.

Bamshad suggest that one way to defend against shills and nukes is to create hybrid recommenders – recommenders that not only use social data but some inherent measure such as text or acoustic similarity. These types of systems are typically more robust than pure social recommenders.

This isn’t a new thing. Around 1998 at Infoseek someone was trying to sell us a system that used auto-categorization with discussion groups. As soon as they opened it up, some hostile users hijacked a Christian group to be in the Satanism category (or maybe Wicca, it was a long time ago).

At the start of this post I called recommender spam a business. I don’t know if anyone is making money on it yet, but it is one more opportunity for spammers. Your product gets a higher rating, they get paid. If you notice that sites suddenly require a login or even a subscription to post ratings, thank the spammers.

Good to Great Search

I was reviewing a sample chapter from Lou Rosenfeld and Rich Wiggins’ upcoming book on search log analysis. This chapter is covers Michigan State University’s steps in patching around an aging AltaVista engine. It is good history, but not very good advice. MSU’s first step was to build a manual Best Bets system to match individual queries to editorially chosen URLs.

Best Bets are very effective, but are usually a last resort, not the first. The strength of Best Bets is that the results are very, very good. The weakness of Best Bets is that the manual effort only improves the results for a single query. That had better be an important query! Most other kinds of tuning help all queries or at least a broad set, perhaps all results from one website or one web page template.

Here is what I suggest for improving your search:

  1. Get a better search engine. This will help all queries, even the ones you don’t measure. If you don’t already have a metric for “better”, use the relevance measure from step 4 combined with the required number of documents and query rate.
  2. Look at the top few hundred queries and record the rank of the first relevant result.
  3. For each query without a good hit in the top three (“above the fold”), find one or more documents (URLs) which would be good results.
  4. If you want a single number for goodness, use the ranks from step 3 to calculate MRR (mean reciprocal rank). Invert each rank number and average them. You’ll get a number between 0 and 1, where “1” means the first hit was relevant every time. If you are getting above 0.5, your engine is doing a pretty good job — you’re averaging a good result in the second position. You need at least 200 queries for MRR measurements to be statistically valid.

Now you have a list of failed queries matched with good documents. Start at the top of that list, and try the following actions for each one. When one of your preferred documents is ranked above the fold, you are done with that query and should move on to the next query in your list.

  1. Are the preferred documents in the index at all? If not, get them in and recheck the ranking.
  2. Are the documents ranked above the preferred ones good quality or junk? If they are unlikely to be a good answer for a reasonable query, get them out of the index and recheck the ranking.
  3. Are the preferred documents valid HTML? Do they depend heavily on IFrames, JavaScript, Flash, or other too-clever features? Fix them to comply with ADA and Section 508 (it’s the law!), reindex, and recheck.
  4. Do the preferred documents have good titles (the <title> tag in HTML)?
    If not, fix that, reindex, and check the ranking.
  5. Take a critical look at the preferred documents and decide whether they really answer the query. If they don’t, add a page which does answer it. Index that page and recheck the ranking.
  6. Do the documents include lots of chrome, navigation, and other stuff which swamp the main content? If so, configure your search engine to selectively index the page (Ultraseek Page Expert) or use engine-specific markup for selective indexing in the page templates. Reindex and check the ranking.
  7. Do the terms in the preferred documents match the query? The query is “jobs” but the page says “careers”? If so, consider adding the keywords meta tag or synonym support in your engine (or go to the next step). Reindex and check the ranking.
  8. Add a manual Best Bet for this specific query pointing to the well-formatted, well-written document with the answer. Schedule a recheck in six months to catch site redesigns, hostname changes, etc. and hope that it doesn’t go stale before then.

As you go through this process, you’ll find entire sites which are not indexed, have bad HTML, are heavy with nav and chrome, or are designed so that they just don’t answer queries (click for the next paragraph). Fixing those will tend to improve lots of things: WWW search rankings, web caching, accessibility, and bookmarkability.

Search matches questions to answers. It is really hard to improve the quality of the questions (get smarter customers?), and the matching algorithms are subtle and tweaky, so don’t be surprised when most of your time is spent improving the quality of the answers.

Stupid Stemmer Tricks

A “stemmer” is software that returns a “stem” for a word, usually removing inflections like plural or past tense. The people writing stemmers seem to think they are finished when the mapping is linguistically sensible, but that leaves plenty of room for dumb behavior. Just because it is in the dictionary doesn’t mean it is the right answer.

Here are some legal, but not very useful things that our stemmers have done for us:

  • “US” to “we” (wrong answer for “US Mail”)
  • “best.com” to “good.com” (oops, don’t run URLs through the stemmer)
  • “number” to “numb” (correct, but when is the last time you meant “more numb”?)
  • “tracking meeting” to “track meet” (gerund to verb that can also be a noun, bleah)

The stemmer people say “use part of speech tagging”, but we need to do exactly the same transformations to the documents and to the queries. Queries rarely have enough text for the tagger to work.

A search-tuned stemmer would be really nice to have. I’ve got some ideas: leave gerunds alone, don’t treat comparatives and superlatives as inflections, and prefer noun-to-noun mapping. It would need to be relevance-tested with real queries against a real corpus, of course.

Articles on ultraseek.com

I write occasional articles on search and Ultraseek for ultraseek.com, so I’ve collected links to them all from here. I’ve clipped the first paragraph or two of each one so you can see whether you’d like to read the entire article.

Relevance and User Satisfaction

Search relevance is usually thought of as a statistic that measures whether the search results match the query. That is useful in the lab, but not as useful for a search installation.

When search is part of a site, we need to understand how it helps the users of that site. Can they find things quickly? Are they comfortable with the search?

My Favorite Customer Problem

We are concerned about all of the problems reported by our customers, but there is one problem I don’t mind hearing about.

Keeping Your Index Fresh with Add URL

Everyone wants their search index to be an accurate, timely reflection of their content. Ultraseek automatically revisits pages to find new URLs, and that is very effective, but some sites have even stronger reqirements for how quickly documents need to be available in search results. This is called “index freshness.” A stale index frequently misses new pages and has old information including pages that have already been deleted, and old copies of pages that have changed since they were indexed. For maximum index freshness, use Ultraseek’s Add URL feature for notifications of deleted, changed, or new URLs.

Don’t Reindex Every Week!

If you have used other search engines, you probably had to manually configure your indexing schedule to make sure new content was found and indexed. This is not necessary with Ultraseek.

Ultraseek has “continuous spidering with real-time indexing and adaptive revisit intervals.” It sounds complicated, but it means that Ultraseek will automatically spider most pages at the right times.

Why Not Use “All Terms” Queries?

Google, Yahoo!, and MSN all default to matching all of your search terms, but Ultraseek does not. Why? What do you say when your users want Ultraseek to “work like Google”?

In most cases, it is good for an enterprise engine to behave like the WWW engines because users can intuitively transfer their searching skills. But, this is a case where doing the right thing is more expensive for the WWW engines, and more reasonable for enterprises.

It looks like I’ve been careful to write useful leads, because selecting the first few sentences makes a pretty fair intro to each article.

A Different Approach to On-line Text

Maybe this is more amusing to people who have worked on search indexes, but I thought it was a worthwhile use of computer resources. Check out Starship Titanic: The Novel!. Click through all the intro pages, that is part of the fun. One of the index pages has a dead link, but there is remarkably little linkrot for something put on the web in 1997.

Don’t miss the colophon and contest page.

Query Box as Extensible UI

Yahoo’s Open Shortcuts is a nice simple extension to search implemented entirely within the query box. We’ve been able to do queries like wikipedia emma bull or zip code palo alto or weather baton rouge for a while, using the first word as a context command. Some of those (“wikipedia”) bias the results, while others display custom information above the results.

Open Shortcuts adds the ability to punch through to a different search engine (like !ebay dagor) and also to define your own contexts that go to your favorite spots.

It is easy to define a shortcut for any Ultraseek search engine. Here’s how to define a shortcut for ultraseek.com.

  1. Go to Create a Shortcut.
  2. Scroll down to the Search a Site section.
  3. Enter a shortcut name, like ultra.
  4. Enter this for the URL: http://search.ultraseek.com/query.html?qt=%s
  5. Click Set, then OK
  6. Try a search, like !ultra all terms and enjoy results direct from the Ultraseek site.

You can use this pattern for most Ultraseek-powered sites. Do a search to find the host with the search engine, then use that host with /query.html?qt=%s. Some sites may have customized pages with a different path. Check your browser’s location bar to make sure.

Nice work, Yahoo, and much simpler than A9’s OpenSearch.

Destructive Spidering

Reading The Spider of Doom on The Daily WTF reminded me of a similar story with Ultraseek from years ago, though ours had a happier ending.

Back in 1998 or 1999, we got a call from a customer asking if it was possible that the spider could delete pages from Lotus Domino. 10,000 pages had disappeared overnight, and the only thing they could think of was the evaluation copy of Ultraseek. After looking at the access logs, we figured out that they had a link to “delete this page” on every page. Also, they’d logged the spider in as Admin so that it could access everything. Oops!

I said there was a happy ending? They restored everything from backups, figured out that a link (GET) was a bad idea and changed it to a button (POST), and they bought Ultraseek because they knew that it could access every single page. On our end, we added a rule to never, ever follow that link on Lotus Domino. We all solved our problems and learned something, too.