Do all-stopword queries matter?

Many search engines don’t index “stopwords”, words that are very common and have little meaning by themselves. The stopword list is often just the most frequent words in the language: “the”, “be” (and its inflections), “a”, “of”, and so on.

Search engines that index all words like to show off searches for “to be or not to be”, because stopword elimination can remove every word in the phrase. Of course, no one really searches for “to be or not to be” because we all know where it came from.

Are there any real titles that are all stopwords? Does this matter? I’ve been indexing movie titles, and found a more than a few that are 100% stopwords.

The last one isn’t a traditional stopward, but think about the number of “click here” links on the web. It is a web stopword, for sure.

Advertisements

Apple and High Standards

At work, we had an e-mail discussion about this New Yorker article on feature proliferation, and one of the strands was about a mis-understanding in the article, specifically that Apple Design equals Simple Design, especially with the iPod. Among the various good observations — “simple is hard”, “sexy supports simple” — I chipped in with the following.

There is a weird follow-on effect that makes the best Mac software vastly better (or at least better-looking) than supposedly equivalent stuff on Windows or Linux. For example, Microsoft’s Mac e-mail tool (Entourage) is much nicer than Outlook.

Even more odd is the case of NetNewsWire, an RSS reader. It has been around for nearly five years, but no one has even managed to copy it on Windows, let alone surpass it. WTF? How hard can it be? Do Windows developers just not pay attention?

In some sort of cosmic synchrony, Tim Bray posted a one-paragraph shout-out to NNW today. Tim is Sun’s Director of Web Technology, the driving force behind Atom, and a really nice person. Heck, it ain’t that long, I’ll quote the whole thing:

The problem is, these days, that my input queues are jammed up. I’m reading Caesar: Life of a Colossus by Adrian Goldsworthy and it’s very good, but it’s awfully big and thick and dense. And my time for reading is tight because, after all, I’m married with two children and also I’m trying to read the Internet, or at least that huge little piece of it where people care about the things I do. And on that subject, once again I just have to plug NetNewsWire. I’ve tried a ton of newsreaders on a ton of platforms. Google’s blog reader is pretty good, and so are a couple of the other clients, but NetNewsWire just shows you more stuff in less time with fewer keystrokes. Years ago I predicted that feed-reading would have been sucked into the browser by now, but I was wrong. So between that and Caesar, and day-to-day job work, and a grungy unexciting complicated fill-a-hole-in-the-ecosystem programming project, well, I have Wikinomics and Everything is Miscellaneous and RESTful Web Services and the Programming Erlang PDF staring accusingly at me from the shadows. Blame Julius Caesar and Brent Simmons.

NetNewsWire rules. It is vastly better than anything else I’ve tried, including Google Reader.

One of the reasons I use a Mac is to keep my standards high. If my work is “as good as Windows”, it isn’t good enough.

My So-Called Life, A Dozen Years Later

Last week our family watched (on VHS!) the pilot for My So-Called Life, the 1994 TV series. Most of it went over our twelve-year-old son’s head, but I was blown away. Even though I’ve seen a lot of the episodes multiple times on MTV marathons and I knew the acting was excellent, I was just amazed at the screenwriting. It is fluid, natural, wonderfully paced, visual, and the voiceovers even fit. There is a dinner scene where four people have four agendas, and they are all talking past each other. There is a montage of a single school day, with the teacher asking a question “What is the purpose of plasma?” followed by an answer “Because it is written in the first person” from the next class. Angela gives an honest response to The Diary of Anne Frank then realizes she has just sounded completely shallow in front of the whole class.

The whole thing should just collapse under the weight of the craft (Citizen Kane just about does that), but it soars.

Why? Because it is true to high school life. It is proof that at least one person grew up and did not forget what it was like and wrote it down.

Selective Page Indexing Directives

If you can control what parts of an HTML page are indexed by a search engine, you can really improve the quality of search results. Unfortunately, there is no standard way to do this, and Yahoo! has just added one more proprietary set of directives.

Some sections of HTML pages are the core content and some are navigation, ads, decoration, or site-related. If a search engine can index just the site-specific content, it will have cleaner data in the search index. I think of this as “gold in, gold out” instead of the more common “garbage in, garbage out”.

Search engines support different, incompatible ways of marking sections which should and should not be indexed. This page includes a list of every selective indexing implementation I’ve found. The original list was compiled when I was designing the Ultraseek Page Expert feature back in 2004.

Yahoo is the only WWW search engine to implement any of these schemes, and they invented their own this month. They claim that they “did a little homework”. I’ll believe the “a little” part since they are reinventing the wheel, and it only took me one day to find and document all these tags. Their customers are already complaining about the “don’t index” sense, which has been a usability problem with many of the other directives over the past ten years.

Do not confuse these directives with the robots meta tag, which provides hints for indexing the entire page. These directives are for sections of a page.

Ultraseek Page Expert Instead of a fixed directive, Page Expert allows you to configure which parts of the existing markup to index. The pages do not need to be changed to include new directives. It comes pre-configured for the MonArch and Hypermail mail archivers and for Javadoc. A visual preview highlights the parts of the page which will be indexed. The page types (sets of filters) can be applied to specific servers or sets of URLs. Page Expert filters can include multiple markup patterns with both index and noindex actions. See the Page Expert info at ultraseek.com for more details. This was introduced in Ultraseek 5.3 (September 2004).

Implemented in: Ultraseek

<noindex></noindex> This is the most widely implemented but has some problems. It isn’t legal HTML 4 or XHTML, and documents with this tag will fail validation. The noindex sections need to be entire blocks of structure, that is, you can’t do <noindex><p>a</noindex>b</p>. On the plus side, it is easy to see how the start and end match, and some HTML editors will help match them for you.

Implemented in: Verity/Autonomy K2, Ultraseek, Atomz, FDSE (with customization)

<!--stopindex--><!--startindex--> Legal in HTML or XHTML, but the sense of the directives confuses some users. People seem to expect to start and stop the noindex section, not the index section. One advantage is that these do not need to match or nest, so that there can be multiple stopindex directives in different templates or SSI’s, and indexing will still start at a startindex directive. These were proposed at Infoseek and implemented in Ultraseek Server](http://www.ultraseek.com/support/faqs/1001.html) in 1997.

Implemented in: Verity Ultraseek

<!--googleoff: all--><!--googleon: all--> A more complicated version of the stopindex structured comments accepted only by the Google Search Appliance. Instead of all, you may use anchor, snippet, or index. It isn’t exactly clear what happens when different directives are mixed or repeated, though some people think that googleon: all will enable all of the attributes. These are documented in Google’s appliance docs which are not publicly available. This description of googleon/googleoff matches what I’ve learned about them. These directives are ignored by Google’s WWW search engine.

Implemented in: Google Search Appliance

<p class="robots-nocontent"> A class which can be applied to any HTML element that allows the class attribute, the robots-nocontent class was introduced by Yahoo for WWW search in May 2007. This is the only selective indexing directive I know of for any WWW search engine. Like the stopindex and googleoff directives above, the inverted sense of this directive seems to confuse many users.

Implemented in: Yahoo! Web Search

<alkaline skip></alkaline> The Alkaline search engine has a product-specific tag. “skip” is one of the options for that, and causes the contained content to be skipped by the indexer. This has the same disadvantages as <noindex>.

Implemented in: Alkaline

<!-- robots content="noindex" --><!-- /robots --> This structured comment borrows the ROBOTS meta tag format. It isn’t clear what happens if the start and end directives are not matched. Is that an error? Does it work like the <!--stopindex--> directives? This is used in two Perl-based search engines.

Implemented in: Fluid Dynamics Search Engine (FDSE), Darryl Burgdorf’s WebSearch

<!-- robots:noindex --><!-- /robots:noindex --> Proposed by Avi Rappoport of searchtools.com. This uses an XML namespace style in a structured comment. Not implemented by any search engines, as far as I know.

Please let me know of any other tags like this.