Destructive Spidering

Reading The Spider of Doom on The Daily WTF reminded me of a similar story with Ultraseek from years ago, though ours had a happier ending.

Back in 1998 or 1999, we got a call from a customer asking if it was possible that the spider could delete pages from Lotus Domino. 10,000 pages had disappeared overnight, and the only thing they could think of was the evaluation copy of Ultraseek. After looking at the access logs, we figured out that they had a link to “delete this page” on every page. Also, they’d logged the spider in as Admin so that it could access everything. Oops!

I said there was a happy ending? They restored everything from backups, figured out that a link (GET) was a bad idea and changed it to a button (POST), and they bought Ultraseek because they knew that it could access every single page. On our end, we added a rule to never, ever follow that link on Lotus Domino. We all solved our problems and learned something, too.

My Search History

I decided to do some quick data gathering with a Google account. For a couple of weeks, I did all my searches on Google (I ususally switch between several engines), then transcribed the results. These are posted here, but with the names of a couple of personal acquaintances redacted. They don’t need even the tiny bit of fame or spam they’d get from this blog.

With a (small) set of personal queries, it should be possible to think about approaches for search result personalization. But looking at these, I don’t see any way to improve search with one person’s query stream. I can figure out the context, but there just isn’t enough data for an algorithm to get its teeth into. Also, there are so many different information needs here that any personalization would always be about four steps behind. Several of these are “gold star queries” — one query, get the info.

Want to guess some of the information needs? For extra credit, can you spot the query tactics used?

  • enterprise search
  • ultraseek
  • enterprise search blogs
  • macintosh cad house
  • coveo, sharepoint
  • sketchup
  • google terms of use
  • google terms and conditions
  • how long does cologne last
  • perfume storage
  • parfum storage
  • jean shepherd
  • proper name of friend from Rice
  • 3.15mm
  • 3.15mm lead
  • tarptent
  • yosemite webcap
  • yosemite webcam
  • yosemite conditiona
  • yosemite conditions
  • yahoo term extraction service
  • proper name of friend from HP
  • proper name of friend from HP (alternate spelling)
  • hp narrative history
  • bike helmet cover ladybug
  • Louis Menand, Tetlock
  • verio signature backroom
  • palo alto aspca
  • palo alto spca
  • pets in need
  • typekey
  • lake huntington weather
  • art in the age of mechanical reproduction
  • buck sharpening guide
  • buck honemaster
  • david rains wallace
  • adam gopnik
  • frank westheimer
  • eagle required
  • english entropy, shannon
  • glass cockpit
  • ohlone wilderness map

Google Linux Distro: Desktop or Appliance?

This news article on Google Ubuntu, aka Goobuntu only talks about desktop Linux, toolbars, clients, and challenging Microsoft. No mention of a base for running Google search without an appliance or web caching or GFS-based fault-tolerant file servers or any of that other server-room stuff.

Funny how people only think about server-side stuff inside Google (“they’re building their own internet!”) and client-side stuff outside (“they want to be Microsoft!”).

I wonder what is really up.

Wikis Considered Harmful to Robots

Most web sites fail to use robots.txt or the robots meta tag, but this rarely causes problems beyond a few junk pages in the index. This is not true for wiki sites. Wikis have tons of extra pages for “edit this” or “show changes”. Those are not the primary content, but a web robot doesn’t know that. The page needs to tell the robot “don’t index me” or “don’t follow my links”. That is what the robots meta tag is for.

I ran into this last week on our intranet. Our spider (Ultraseek) was visiting over 1000 URLs for every page of real content on a wiki!

One of our internal groups put up a copy of MediaWiki. This can show every revision of a page and diffs between each of those revisions. All of these are links, so a spider will follow all the links and find itself with a combinatorial explosion of pages that don’t really belong in the index. This can get really bad, really fast. The Ultraseek spider sent me e-mail when it hit 300,000 URLs on the wiki. After investigating and putting in some spider URL filters, the site has about 300 URLs and about 150 pages of content (it is normal to have more URLs than docs up to about 2:1). There might have been more than a 1000X explosion of URLs — the spider was still finding new ones when I put in the filters.

To get a feeling for the number of URLs, look at the the Names page on MediaWiki then look at all the URLs on the history tab for that page. Yikes!

At a minimum, there should be a robots meta tag with "noindex, nofollow" on the history tab and on all of the old versions and diffs. That would result in the spider visting one extra page, the history tab, but the madness would stop right there. A spider can deal with one junk page for each content page, but the thousand-to-one ratio I saw on our internal wiki is bad for everybody. Imagine the wasted network traffic and server load to make all the diffs.

Some people would suggest using POST instead of GET for these pages. That would be wrong. A request to a history resource (URL) does not affect the resource (a POST), it get the information (a GET). The response should be cachable, for instance. But please mark it "noindex, nofollow".

I don’t mean to pick on MediaWiki, that is just a specific example of a widespread problem. MediaWiki actually gets extra points for sending a correct last-modified header on the content pages. It is sad that something as fundamental as a correct HTTP header gets extra points, but that is what the web looks like from a robot point of view.

WikiPeople everywhere, use the robots meta tag! Robots will thank you, and they will stop beating the living daylights out of your servers. While you are at it, make sure that last-modified and if-modified-since work, too.

Recommendation System Spam is Not New

John Battelle comments on gaming a recommendation system as a possible explanation for the embarrassing “similar items” result at walmart.com:

We all know about Google Bombing. But Recommendation System Bombing? That’s a new one to me!

Not new at all. We saw this at Infoseek in roughly 1998. As I remember, a vendor came to us with a system and we showed them how easy it was to spam. I’m sure that Firefly knew about it before then. Any kind of adaptive system (ranking, recommendation) is vulnerable. That is probably why personalization keeps reappearing on WWW engines, the login makes it much harder to spam.

Think like a spammer — if the system reacts to you, you can probably hack it.

Two Predictions for Enterprise Search

David Heinemeier Hansson and John Battelle have predictions pointing in opposite directions for enterprise search.

David’s prediction covers all enterprise software:

Enterprise will follow legacy to become a common insult among software creators and users.

John’s prediction is specific to search:

Enterprise search will show us a few new approaches to consumer search, and vice versa. In fact, we may get to the point where the two are often indistinguishable.

I disagree with both of them. Enterprise search and WWW search are more different than they seem. Yes, there is a bit of cross-fertilization, but each one has critical problems that just don’t exist for the other. WWW search engines must fight spam, sell ads, and deal with insane scale. For enterprise search, access to repositories, security, and ease of admin are essential.

Will WWW search and enterprise search be indistinguishable? Yes, for end users. Each one already has a box, a button, and a result list. Underneath, they are quite different, like two antibiotics designed for different kinds of bacteria. They may look the same, but you’d better pick the one designed for your problem.