Most web sites fail to use
robots.txt or the robots meta tag, but this rarely causes problems beyond a few junk pages in the index. This is not true for wiki sites. Wikis have tons of extra pages for “edit this” or “show changes”. Those are not the primary content, but a web robot doesn’t know that. The page needs to tell the robot “don’t index me” or “don’t follow my links”. That is what the robots meta tag is for.
I ran into this last week on our intranet. Our spider (Ultraseek) was visiting over 1000 URLs for every page of real content on a wiki!
One of our internal groups put up a copy of MediaWiki. This can show every revision of a page and diffs between each of those revisions. All of these are links, so a spider will follow all the links and find itself with a combinatorial explosion of pages that don’t really belong in the index. This can get really bad, really fast. The Ultraseek spider sent me e-mail when it hit 300,000 URLs on the wiki. After investigating and putting in some spider URL filters, the site has about 300 URLs and about 150 pages of content (it is normal to have more URLs than docs up to about 2:1). There might have been more than a 1000X explosion of URLs — the spider was still finding new ones when I put in the filters.
To get a feeling for the number of URLs, look at the the Names page on MediaWiki then look at all the URLs on the history tab for that page. Yikes!
At a minimum, there should be a robots meta tag with
"noindex, nofollow" on the history tab and on all of the old versions and diffs. That would result in the spider visting one extra page, the history tab, but the madness would stop right there. A spider can deal with one junk page for each content page, but the thousand-to-one ratio I saw on our internal wiki is bad for everybody. Imagine the wasted network traffic and server load to make all the diffs.
Some people would suggest using POST instead of GET for these pages. That would be wrong. A request to a history resource (URL) does not affect the resource (a POST), it get the information (a GET). The response should be cachable, for instance. But please mark it
I don’t mean to pick on MediaWiki, that is just a specific example of a widespread problem. MediaWiki actually gets extra points for sending a correct last-modified header on the content pages. It is sad that something as fundamental as a correct HTTP header gets extra points, but that is what the web looks like from a robot point of view.
WikiPeople everywhere, use the robots meta tag! Robots will thank you, and they will stop beating the living daylights out of your servers. While you are at it, make sure that last-modified and if-modified-since work, too.