At Chegg, we test the relevance of our search engine using customer data. We extract anonymous information about queries and clicks, then use that to automatically test improvements to search. When our search engine provides results that better match what our customers are choosing, we call that an improvement.
The most basic measurement of search quality is clickthrough rate. For each search page that is shown, how often is at least one search result clicked on? This fits well with overall website conversion, showing how search works in getting a visitor to a successful result.
We can take a more critical look at our search page by not counting all clicks equally. If we count less for clicks farther down the search result list, we can evaluate the search ranking. When visitors click on the top hit, count a full click. If they click farther down, count less than a full click. We penalize ourselves for not putting the most-desired item at the top of the list.
We do this with mean reciprocal rank (MRR). We weight each click with the reciprocal of the search result’s rank. Rank one (the top result) is scored as 1, rank 2 is scored as 1/2, rank three as 1/3, rank 4 as 1/4, and so on. Then we add those reciprocal ranks up and divide by the total to get the mean, thus mean reciprocal rank. You can also think of this as a weighted clickthrough, where clicks farther down the result list are given lower scores. MRR will always be equal to or lower than clickthrough rate.
When evaluating our search engine in test, we use clicks we have collected from the website and divide by the total number of clicks. A perfect score would be 1.0, but only if there was one correct answer for each query.
Though each customer’s search has one right answer, different customers have different right answers. The same course at different universities uses different textbooks, sometimes with the same name. Even if they use the same textbook, they may use different editions. Users who typed “financial accounting” at Chegg last winter clicked on 43 different books. That is a lot more than fit in the first page of ten hits.
Some books were more popular than others, so we can use that information to test our search engine. The most popular book should be the first result, the second most popular the second hit, and so on. Of course, we have to do this without causing problems for any other search result, like queries for “introduction to financial accounting” or “financial and managerial accounting”. Here are the top five textbooks clicked on for “financial accounting” last January, with the number of times each one was clicked on.
||Financial Accouting, 8th Edition, Harrison et al
||Financial Accouting, 2nd Edition, Spiceland et al
||Financial Accouting, 7th Edition, Weygandt et al
||Financial Accouting, 6th Edition, Kimmel et al
||Financial Accouting, 7th Edition, Libby et al
Now let’s calculate the MRR for the query “financial accouting”, assuming that the search engine returns the books in the ideal order: ABCDE. We multiply the number of clicks for each rank by the reciprocal rank, add them up, and divide by the total number of clicks. This is the optimal ranking, so it is the highest possible MRR value for this query.
(145 * 1 +
130 * 1/2 +
119 * 1/3 +
106 * 1/4 +
80 * 1/5) / 580 = 0.504
Real search engines are not perfect, so let’s assume our engine shows the second most popular book first, then a non-relevant book, then the rest in order. Using an ‘x’ for the non-relevant book, the results order is BxACDE.
(130 * 1 +
145 * 1/3 +
119 * 1/4 +
106 * 1/5 +
80 * 1/6) / 580 = 0.418
Now we can build a repeatable test for relevance. We collect a set of queries with click counts. A script runs each query, then scores the returned results against the click data. At the end of the run, it calculates MRR for the whole set. The final MRR is reported as an overall metric, and the MRR for each query is reported for diagnostics. The script also calculates the ideal MRR for the whole set and for each query. This can run quickly, even a few thousand queries only takes a few minutes.
The first time we run the test, the result is saved as the baseline MRR. When we change the search algorithm, we can test to see if it is better or worse than the baseline MRR. When the search algorithm is improved, we set a new, higher baseline MRR.
We can also look at individual queries where the MRR is much worse than ideal to find bugs in the search engine. When we first got our MRR testing working, I found several bugs that were easy to fix. For example, we added synonyms for “intro” and “introduction” and also made “&” synonymous with “and”.
MRR is not the only relevance metric. It is best suited when there is one correct answer. At Chegg, there is one correct textbook for each person’s search. This is a known item search or navigational search. Other metrics are better for informational searches, where partial information is gathered from multiple results. If I am planning a long weekend hike in the Sierras, the metric needs to count more than one search click, because I am gathering information from many sources. Some evaluation metrics for informational searches are nDCG, RBP, and ERR.
Regardless of which metric you choose, start gathering search and click data right now. I was surprised at how much data I needed to improve individual search queries. Twenty million clicks is a good start!
Originally published on the Chegg tech blog in December 2012.