On a search engine mailing list, the topic of simple A/B testing between search engines came up. This can be between different implementations, different tunings, or different UI presentations. The key thing is that users are offered two alternatives and asked which one they like better. One bit of information, this one or that one. If you’ve been to the Kitten War site, you’ll understand why I call it “kitten war testing”. Others may call it a “beauty contest”. They are wrong, of course.
During the years I worked on Ultraseek, surprisingly few customers had the spare time to run serious tests. One national laboratory ran tests as part of their evaluation and later ran larger tests on their intranet design. Another ran regular tests on all changes to their intranet search, presentation or ranking. These were the exception. We had at least 10,000 customers over nine years and only a handful ran serious tests.
Where I work now, we have a few million queries per day, so we can easily put a few tens of thousands of users into test and control cells. We do that for all changes, religiously. Most people don’t have that luxury, but you can run a kitten war test and rise above the superstitious masses on a wave of real data.
Kitten war testing can be very effective, but it is very, very easy to mess it up. Here are some things to watch out for.
Cuteness Counts: Just like with kittens, the prettiest search results page will almost always win. If there are two engines, they must be configured to look as similar as possible. Really. It is OK if they are both ugly, just make them the same. Double-blind is even better, but the cuteness judge must not know which one is which.
Longhairs are Cuter: Watch out for visible differences which are not essential to the engines. The length of summaries is one of those. On the other hand, some differences may be intrinsic to the different engines. For a while, I have felt that Yahoo’s snippets are slightly better than Google’s. Snippets are really hard and a reasonable part of a kitten war test.
Google is Cuter: Or, “brand names work”. One of our Ultraseek customers did a blind kitten war test between Ultraseek and Google. Ultraseek was preferred 75% of the time. Some executive found this hard to believe and asked that they try it again with the logos attached to the pages. The second time, people preferred Google over half the time.
5% Doesn’t Matter: Unless you can get lots of data points, you’ll have very noisy results. 75% is a strong result, but a 48/52 split is not good enough. We run on-line tests with less than 1% difference, but it takes about 100K samples for those numbers to settle down. Find someone who got A’s in statistics and ask them to help.
Will Search for Food: If you can’t get a thousand searches, run it as a qualitative test. Set up a table in the lunch room, hand out cookies, and ask people to run a few pre-determined searches and the last few things they personally looked for on your intranet. Talk to them about why they like one or the other. Ask them what they expected to find. Have an observer take notes, lots of notes. You should still short for more than 50 users. 200 would be better. Could be a long week in the lunch room.
Cute Overload: Allow for “can’t decide”. Sometimes, both kittens are equally cute.
Let’s take a break and recommend a couple of slightly more serious posts on A/B testing:
Fundamentally, Kitten War testing gets close to the truth — which engine makes your users happiest. You might argue for the shortest task completion time, but happy users are a very fine thing.
Kitten war testing is not usability testing. If you are trying to improve usability, do real usability testing. That isn’t really harder than kitten war testing, but it is a different kind of test. For a quick intro, see Jakob Nielson on usability testing with five users.