At the Orthodontist

I spent an hour plus sitting at the orthodontist one morning last week while my son got started on his second round of braces. I was wearing my Netflix sweatshirt, so I chatted with the assistant about movies, search, Don, and streaming. I pointed out the Netflix support in the new LG Blu-ray player, and the kid in the neighboring chair said, “and the Roku box, we watch a lot of stuff on that”.

I love being in Silicon Valley. Even the middle-schoolers are on top of the tech trends.

Search Evaluation by Kitten War

On a search engine mailing list, the topic of simple A/B testing between search engines came up. This can be between different implementations, different tunings, or different UI presentations. The key thing is that users are offered two alternatives and asked which one they like better. One bit of information, this one or that one. If you’ve been to the Kitten War site, you’ll understand why I call it “kitten war testing”. Others may call it a “beauty contest”. They are wrong, of course.

During the years I worked on Ultraseek, surprisingly few customers had the spare time to run serious tests. One national laboratory ran tests as part of their evaluation and later ran larger tests on their intranet design. Another ran regular tests on all changes to their intranet search, presentation or ranking. These were the exception. We had at least 10,000 customers over nine years and only a handful ran serious tests.

Where I work now, we have a few million queries per day, so we can easily put a few tens of thousands of users into test and control cells. We do that for all changes, religiously. Most people don’t have that luxury, but you can run a kitten war test and rise above the superstitious masses on a wave of real data.

Kitten war testing can be very effective, but it is very, very easy to mess it up. Here are some things to watch out for.

Cuteness Counts: Just like with kittens, the prettiest search results page will almost always win. If there are two engines, they must be configured to look as similar as possible. Really. It is OK if they are both ugly, just make them the same. Double-blind is even better, but the cuteness judge must not know which one is which.

Longhairs are Cuter: Watch out for visible differences which are not essential to the engines. The length of summaries is one of those. On the other hand, some differences may be intrinsic to the different engines. For a while, I have felt that Yahoo’s snippets are slightly better than Google’s. Snippets are really hard and a reasonable part of a kitten war test.

Google is Cuter: Or, “brand names work”. One of our Ultraseek customers did a blind kitten war test between Ultraseek and Google. Ultraseek was preferred 75% of the time. Some executive found this hard to believe and asked that they try it again with the logos attached to the pages. The second time, people preferred Google over half the time.

5% Doesn’t Matter: Unless you can get lots of data points, you’ll have very noisy results. 75% is a strong result, but a 48/52 split is not good enough. We run on-line tests with less than 1% difference, but it takes about 100K samples for those numbers to settle down. Find someone who got A’s in statistics and ask them to help.

Will Search for Food: If you can’t get a thousand searches, run it as a qualitative test. Set up a table in the lunch room, hand out cookies, and ask people to run a few pre-determined searches and the last few things they personally looked for on your intranet. Talk to them about why they like one or the other. Ask them what they expected to find. Have an observer take notes, lots of notes. You should still short for more than 50 users. 200 would be better. Could be a long week in the lunch room.

Cute Overload: Allow for “can’t decide”. Sometimes, both kittens are equally cute.

Let’s take a break and recommend a couple of slightly more serious posts on A/B testing:

Fundamentally, Kitten War testing gets close to the truth — which engine makes your users happiest. You might argue for the shortest task completion time, but happy users are a very fine thing.

Kitten war testing is not usability testing. If you are trying to improve usability, do real usability testing. That isn’t really harder than kitten war testing, but it is a different kind of test. For a quick intro, see Jakob Nielson on usability testing with five users.

Two Stories About Marriage

I agree with Plain Jane Mom, this first story, She’s happily married, dreaming of divorce, is about the most depressing thing I’ve ever read about a “good” marriage.

There are so many things wrong about this. Leaving your shoes in the way isn’t even being a good roommate, let alone a good husband. Writing your complaints in O: The Oprah Magazine instead of going to counseling is a cheat. This is isn’t source material, it is your marriage.

Some of it hardly sounds real. Does she really believe that every wife thinks of divorce as a security blanket, that it is “the closely held contemplation of nearly every woman I know who has children who have been out of diapers for at least two years and a husband who won’t be in them for another 30.”?

Of course married people think seriously about divorce, as Ambrose Bierce said, “Who never doubted, never half believed.” But to treasure it? To call it a “secret reverie”? Dear Abby would tell you to get to a marriage counselor. Get some coaching in being human to each other, talking, living. It works, we’ve done it.

After that has thoroughly bummed you out, or perhaps, after you skip it, read John Scalzi on losing wedding rings and his tenth anniversary. It is full of the shared life, secret jokes, and surprises that only happen when you live together and love each other for that long.