Measuring Search Relevance with MRR

At Chegg, we test the relevance of our search engine using customer data. We extract anonymous information about queries and clicks, then use that to automatically test improvements to search. When our search engine provides results that better match what our customers are choosing, we call that an improvement.

The most basic measurement of search quality is clickthrough rate. For each search page that is shown, how often is at least one search result clicked on? This fits well with overall website conversion, showing how search works in getting a visitor to a successful result.

We can take a more critical look at our search page by not counting all clicks equally. If we count less for clicks farther down the search result list, we can evaluate the search ranking. When visitors click on the top hit, count a full click. If they click farther down, count less than a full click. We penalize ourselves for not putting the most-desired item at the top of the list.

We do this with mean reciprocal rank (MRR). We weight each click with the reciprocal of the search result’s rank. Rank one (the top result) is scored as 1, rank 2 is scored as 1/2, rank three as 1/3, rank 4 as 1/4, and so on. Then we add those reciprocal ranks up and divide by the total to get the mean, thus mean reciprocal rank. You can also think of this as a weighted clickthrough, where clicks farther down the result list are given lower scores.¬†MRR will always be equal to or lower than clickthrough rate.

When evaluating our search engine in test, we use clicks we have collected from the website and divide by the total number of clicks. A perfect score would be 1.0, but only if there was one correct answer for each query.

Though each customer’s search has one right answer, different customers have different right answers. The same course at different universities uses different textbooks, sometimes with the same name. Even if they use the same textbook, they may use different editions. Users who typed “financial accounting” at Chegg last winter clicked on 43 different books. That is a lot more than fit in the first page of ten hits.

Some books were more popular than others, so we can use that information to test our search engine. The most popular book should be the first result, the second most popular the second hit, and so on. Of course, we have to do this without causing problems for any other search result, like queries for “introduction to financial accounting” or “financial and managerial accounting”. Here are the top five textbooks clicked on for “financial accounting” last January, with the number of times each one was clicked on.

Click Count Book
A 145 Financial Accouting, 8th Edition, Harrison et al
B 130 Financial Accouting, 2nd Edition, Spiceland et al
C 119 Financial Accouting, 7th Edition, Weygandt et al
D 106 Financial Accouting, 6th Edition, Kimmel et al
E 80 Financial Accouting, 7th Edition, Libby et al

Now let’s calculate the MRR for the query “financial accouting”, assuming that the search engine returns the books in the ideal order: ABCDE. We multiply the number of clicks for each rank by the reciprocal rank, add them up, and divide by the total number of clicks. This is the optimal ranking, so it is the highest possible MRR value for this query.

(145 * 1 +
130 * 1/2 +
119 * 1/3 +
106 * 1/4 +
80 * 1/5) / 580 = 0.504

Real search engines are not perfect, so let’s assume our engine shows the second most popular book first, then a non-relevant book, then the rest in order. Using an ‘x’ for the non-relevant book, the results order is BxACDE.

(130 * 1 +
145 * 1/3 +
119 * 1/4 +
106 * 1/5 +
80 * 1/6) / 580 = 0.418

Now we can build a repeatable test for relevance. We collect a set of queries with click counts. A script runs each query, then scores the returned results against the click data. At the end of the run, it calculates MRR for the whole set. The final MRR is reported as an overall metric, and the MRR for each query is reported for diagnostics. The script also calculates the ideal MRR for the whole set and for each query. This can run quickly, even a few thousand queries only takes a few minutes.

The first time we run the test, the result is saved as the baseline MRR. When we change the search algorithm, we can test to see if it is better or worse than the baseline MRR. When the search algorithm is improved, we set a new, higher baseline MRR.

We can also look at individual queries where the MRR is much worse than ideal to find bugs in the search engine. When we first got our MRR testing working, I found several bugs that were easy to fix. For example, we added synonyms for “intro” and “introduction” and also made “&” synonymous with “and”.

MRR is not the only relevance metric. It is best suited when there is one correct answer. At Chegg, there is one correct textbook for each person’s search. This is a known item search or navigational search. Other metrics are better for informational searches, where partial information is gathered from multiple results. If I am planning a long weekend hike in the Sierras, the metric needs to count more than one search click, because I am gathering information from many sources. Some evaluation metrics for informational searches are nDCG, RBP, and ERR.

Regardless of which metric you choose, start gathering search and click data right now. I was surprised at how much data I needed to improve individual search queries. Twenty million clicks is a good start!

Originally published on the Chegg tech blog in December 2012.

Ultraseek vs. Google Search Appliance

On the occasion of the Googlebox end of life news, it is time to talk about what a weak product it really was.

Sandia Labs was an Ultraseek customer and ran a relevance experiment where Ultraseek trounced the Google Search Appliance. But first some history.

Many of the US national laboratories used Ultraseek. I don’t remember how it started, but I was invited to give talks about search at two of their IT conferences, one at Oak Ridge National Laboratory (auditorium named after Nobel laureate Eugene Wigner) and one at SLAC, Stanford Linear Accelerator (first website outside Europe).

The labs were quite happy with Ultraseek, but at Sandia, the search team was asked to evaluate the Google Search Appliance. Like good scientists, they set up an experiment. They formatted the results from the two engines in similar, anonymous styles. They set up a table in the cafeteria, offering free cookies for people to try searches and choose the best results from the two engines.

This is a simple but very effective evaluation technique. I like it because it judges the whole result set, both ranking and presentation. It isn’t good for diagnostics, but it is great for customer satisfaction. I call this approach “Kitten War“.

Ultraseek won the experiment, 75% to 25%. That is a three-to-one preference. I’ve never seen that magnitude in a search experiment. In search, we break out the champagne when we get a one percentage point improvement in clickthrough. I’m not kidding. This is beyond massive.

Whoever was pushing Google at Sandia asked them to re-run the experiment with the logos. With that change, Google won 55% to 45%.

Also, performance? Ultraseek was spec’ed for 15 queries/sec and surpassed that spec. The first release of the Googlebox was spec’ed at 30 queries/min, thirty times slower. They later increased that to 60 queries/min. That is one query per second.

Ultraseek actually ran at around 25+ qps, though some new features dropped us closer to 15 qps.

We were the public search engine for irs.gov through Anderson Consulting. Instead of reading the specs, Anderson promised what they had measured instead of the specs, then complained to us. They were massive a-holes about it, even after I made it very clear that it was their fault. But we made Ultraseek even faster, because who wants the IRS search to be slow? irs.gov ran a cluster of fifteen Ultraseek servers. Would not want to try and make that rate with Googleboxes.

Sadly, the relevance test was the point when Ultraseek should have just given away the source code and gone home. The Google logo was enough to sell a massively inferior product. There was nothing we could do in engineering, sales, whatever, to compete with the Google logo.

Sandia Labs did stay with Ultraseek and we continued on for a number of years, but the writing was on the wall.

What is like Dersu Uzala?

Tom Mangan recommended Dersu Uzala, so I added it to our Netflix Queue.

Funny, Netflix isn’t quite sure what other movies are like Dersu Uzala. I don’t really blame them — what is like a masterpiece?

I guess the list illuminates aspects of the film. These are the films it showed after I added Dersu Uzala to our queue:

Kurosawa, Tarkovsky, Fritz Lang, and Burt Reynolds, no place but Netflix.

Most Popular Netflix Movie in Palo Alto

Kinda recursive, but true:

Palo alto ca


Check out Palo Alto, CA at Netflix or at the official site. The writer, director, and producer are 2005 graduates of Palo Alto High School and the movie was shot on location. Big surprise. Here is a story about it from their high school newspaper, The Paly Voice.

Palo Alto hasn’t been this famous since The Donnas and Chelsea Clinton.

Check out your own local favorites at Netflix.

Query Box as Confessional Box

We have a very small number of very long queries that are timing out in the search engine, so I was digging through the logs looking at long queries. I found this.

“something that i think will never happen just did and i dont like it one single bit, no not even a drop, and i wish it never did because it just ruined my life and i just want to watch indianna jones”

199 characters. I think I’ll set the limit somewhere over 200 characters. I’d hate to make their day worse.

Search Evaluation by Kitten War

On a search engine mailing list, the topic of simple A/B testing between search engines came up. This can be between different implementations, different tunings, or different UI presentations. The key thing is that users are offered two alternatives and asked which one they like better. One bit of information, this one or that one. If you’ve been to the Kitten War site, you’ll understand why I call it “kitten war testing”. Others may call it a “beauty contest”. They are wrong, of course.

During the years I worked on Ultraseek, surprisingly few customers had the spare time to run serious tests. One national laboratory ran tests as part of their evaluation and later ran larger tests on their intranet design. Another ran regular tests on all changes to their intranet search, presentation or ranking. These were the exception. We had at least 10,000 customers over nine years and only a handful ran serious tests.

Where I work now, we have a few million queries per day, so we can easily put a few tens of thousands of users into test and control cells. We do that for all changes, religiously. Most people don’t have that luxury, but you can run a kitten war test and rise above the superstitious masses on a wave of real data.

Kitten war testing can be very effective, but it is very, very easy to mess it up. Here are some things to watch out for.

Cuteness Counts: Just like with kittens, the prettiest search results page will almost always win. If there are two engines, they must be configured to look as similar as possible. Really. It is OK if they are both ugly, just make them the same. Double-blind is even better, but the cuteness judge must not know which one is which.

Longhairs are Cuter: Watch out for visible differences which are not essential to the engines. The length of summaries is one of those. On the other hand, some differences may be intrinsic to the different engines. For a while, I have felt that Yahoo’s snippets are slightly better than Google’s. Snippets are really hard and a reasonable part of a kitten war test.

Google is Cuter: Or, “brand names work”. One of our Ultraseek customers did a blind kitten war test between Ultraseek and Google. Ultraseek was preferred 75% of the time. Some executive found this hard to believe and asked that they try it again with the logos attached to the pages. The second time, people preferred Google over half the time.

5% Doesn’t Matter: Unless you can get lots of data points, you’ll have very noisy results. 75% is a strong result, but a 48/52 split is not good enough. We run on-line tests with less than 1% difference, but it takes about 100K samples for those numbers to settle down. Find someone who got A’s in statistics and ask them to help.

Will Search for Food: If you can’t get a thousand searches, run it as a qualitative test. Set up a table in the lunch room, hand out cookies, and ask people to run a few pre-determined searches and the last few things they personally looked for on your intranet. Talk to them about why they like one or the other. Ask them what they expected to find. Have an observer take notes, lots of notes. You should still short for more than 50 users. 200 would be better. Could be a long week in the lunch room.

Cute Overload: Allow for “can’t decide”. Sometimes, both kittens are equally cute.

Let’s take a break and recommend a couple of slightly more serious posts on A/B testing:

Fundamentally, Kitten War testing gets close to the truth — which engine makes your users happiest. You might argue for the shortest task completion time, but happy users are a very fine thing.

Kitten war testing is not usability testing. If you are trying to improve usability, do real usability testing. That isn’t really harder than kitten war testing, but it is a different kind of test. For a quick intro, see Jakob Nielson on usability testing with five users.

Odd Cataloging Decisions at Palo Alto Library

I really wonder about the cataloging at my local library. I was looking for books by Jo Walton and I noticed that a series by her was spread across two areas, both arguably wrong. First, Ha’penny is a sequel to Farthing, so they really should be shelved in the same section. Second, they are both alternate history novels from a fantasy author, and I wouldn’t look in either Mystery or Fiction for them.

Check out this screenshot from their search on July 2nd.

PA Library webcat screenshot

Big hint, Tor has been a major SF&F imprint for over 25 years.

I’m looking forward to Palo Alto’s choice for Half a Crown, the next book in the series. Maybe DDC 737 (Numismatics)?

I reported this to the reference desk at Main. Let’s hope they fix it.

The fun doesn’t stop there. I’m currently reading The Fall of the Kings. That was shelved in YA Fiction, where it doesn’t even belong. I read a fair mix of books, from Westerfeld to Dostoevsky, with plenty of YA, and this just doesn’t fit in the Teen collection. It is long (476 pages of small print), there are no teenage characters, nearly every chapter has sex and/or violence, it is quite slow moving, and it helps if you care about university politics. I read Valiant immediately before, and that book has half the word count with double the action and four times the dialogue, plus teens, fairies, drugs, NYC, and a massive betrayal by mom. Valiant belongs in the Teen section. Dreamhunter belongs there. The Fall of the Kings does not.

I thought that maybe, just maybe, they put it in YA because the most recent book in the series, The Privilege of the Sword, has a 15 year old girl as the main character and can easily be considered YA, so they decided to keep them together. Sorry, they shelved that one in Science Fiction.

I know that strictly defining Science Fiction (or Fantasy) is nearly impossible, but they must be able to avoid howlers like this. Yes, Michael Chabon has written fantasy (Summerland) and SF (The Yiddish Policemen’s Union) but it might as well be shelved in the mainstream section because that is where people will look for him. On the other hand, Jo Walton has written a sword and sorcery trilogy and a book set in Victorian England where the nobility are dragons. Where would you look? Heck, ask Jo Walton. Her answer to the FAQ “What genre is Farthing?” reads “It’s an alternate history mystery. I think that makes it SF.”

Hmm, Palo Alto also shelves The Lord of the Rings and Jacqueline Carey’s Kushiel series in mainstream Fiction. Bizarre. The Kushiel books are also published by Tor. Can we just shelve all the Tor in SF, as a stopgap?