Articles on ultraseek.com

I write occasional articles on search and Ultraseek for ultraseek.com, so I’ve collected links to them all from here. I’ve clipped the first paragraph or two of each one so you can see whether you’d like to read the entire article.

Relevance and User Satisfaction

Search relevance is usually thought of as a statistic that measures whether the search results match the query. That is useful in the lab, but not as useful for a search installation.

When search is part of a site, we need to understand how it helps the users of that site. Can they find things quickly? Are they comfortable with the search?

My Favorite Customer Problem

We are concerned about all of the problems reported by our customers, but there is one problem I don’t mind hearing about.

Keeping Your Index Fresh with Add URL

Everyone wants their search index to be an accurate, timely reflection of their content. Ultraseek automatically revisits pages to find new URLs, and that is very effective, but some sites have even stronger reqirements for how quickly documents need to be available in search results. This is called “index freshness.” A stale index frequently misses new pages and has old information including pages that have already been deleted, and old copies of pages that have changed since they were indexed. For maximum index freshness, use Ultraseek’s Add URL feature for notifications of deleted, changed, or new URLs.

Don’t Reindex Every Week!

If you have used other search engines, you probably had to manually configure your indexing schedule to make sure new content was found and indexed. This is not necessary with Ultraseek.

Ultraseek has “continuous spidering with real-time indexing and adaptive revisit intervals.” It sounds complicated, but it means that Ultraseek will automatically spider most pages at the right times.

Why Not Use “All Terms” Queries?

Google, Yahoo!, and MSN all default to matching all of your search terms, but Ultraseek does not. Why? What do you say when your users want Ultraseek to “work like Google”?

In most cases, it is good for an enterprise engine to behave like the WWW engines because users can intuitively transfer their searching skills. But, this is a case where doing the right thing is more expensive for the WWW engines, and more reasonable for enterprises.

It looks like I’ve been careful to write useful leads, because selecting the first few sentences makes a pretty fair intro to each article.

A Different Approach to On-line Text

Maybe this is more amusing to people who have worked on search indexes, but I thought it was a worthwhile use of computer resources. Check out Starship Titanic: The Novel!. Click through all the intro pages, that is part of the fun. One of the index pages has a dead link, but there is remarkably little linkrot for something put on the web in 1997.

Don’t miss the colophon and contest page.

Query Box as Extensible UI

Yahoo’s Open Shortcuts is a nice simple extension to search implemented entirely within the query box. We’ve been able to do queries like wikipedia emma bull or zip code palo alto or weather baton rouge for a while, using the first word as a context command. Some of those (“wikipedia”) bias the results, while others display custom information above the results.

Open Shortcuts adds the ability to punch through to a different search engine (like !ebay dagor) and also to define your own contexts that go to your favorite spots.

It is easy to define a shortcut for any Ultraseek search engine. Here’s how to define a shortcut for ultraseek.com.

  1. Go to Create a Shortcut.
  2. Scroll down to the Search a Site section.
  3. Enter a shortcut name, like ultra.
  4. Enter this for the URL: http://search.ultraseek.com/query.html?qt=%s
  5. Click Set, then OK
  6. Try a search, like !ultra all terms and enjoy results direct from the Ultraseek site.

You can use this pattern for most Ultraseek-powered sites. Do a search to find the host with the search engine, then use that host with /query.html?qt=%s. Some sites may have customized pages with a different path. Check your browser’s location bar to make sure.

Nice work, Yahoo, and much simpler than A9’s OpenSearch.

Older than FORTRAN

But only by a few months. Today is my 50th birthday, and the most reliable “birthday” I can find for FORTRAN is October 15, 1956, the publication date for the FORTRAN Programmer’s Reference Manual (scanned PDF).

I wrote my first program in FORTRAN. To be specific, FORTRAN IV EMU from Eastern Michigan University, running on the IBM 1401 (I think) at Rose-Hulman Institute of Technology. I was at Operation Catapult, a three-week program for high school juniors. Big fun, and I’m glad to see it is still running.

The program was a two-body simulation, with the paths printed in in line-printer graphics. I wonder if I still have a copy of that somewhere in the “closed stacks” at the back of the garage.

FORTRAN wasn’t my first computer language, that was BNF grammars. I was reading SF in math class because I was being taught logarithms for the third time, and I’d learned them before I was taught them the first time (got a slide rule for Christmas in seventh grade). The teacher noticed and had me stay after to chat. He sympathized, but asked me to at least read a math book during class. So, I found one on computer programming and churned through it over a couple of weeks. I still have a fondness for colon-equals as an assignment op.

PowerBook out the Window

No, I didn’t throw it and I’m not switching. Someone broke our window at 3:30 AM and grabbed my PowerBook off the table. Gone.

A window breaking is really loud. We thought that the kittens had manged to knock down a stack of cookie sheets with dishes on top of it until we found the broken glass by the table. The Palo Alto police were really nice, but it was hard to get back to sleep. The kids slept through the whole thing, of course. And all this two days before we left on vacation.

The IT department has been really great — my new IntelBook is already delivered, waiting for me to return from Maui.

I miss the data more than the hardware. I wasn’t very good about backups, but I did treat most of the laptop data as volatile. E-mail lives on the server and I’m religous about the digital photos being on two separate storage devices before I delete them from the camera. Code is all in CVS. Software keys are copied to the home iMac. Still, there are plenty of miscellaneous things that are just gone, like notes from the Patrol Leaders Council (time to trust the Troop Scribe to take notes).

Since I’m starting clean on the new machine, I’m open to recommendations for Mac software (especially backup).

Destructive Spidering

Reading The Spider of Doom on The Daily WTF reminded me of a similar story with Ultraseek from years ago, though ours had a happier ending.

Back in 1998 or 1999, we got a call from a customer asking if it was possible that the spider could delete pages from Lotus Domino. 10,000 pages had disappeared overnight, and the only thing they could think of was the evaluation copy of Ultraseek. After looking at the access logs, we figured out that they had a link to “delete this page” on every page. Also, they’d logged the spider in as Admin so that it could access everything. Oops!

I said there was a happy ending? They restored everything from backups, figured out that a link (GET) was a bad idea and changed it to a button (POST), and they bought Ultraseek because they knew that it could access every single page. On our end, we added a rule to never, ever follow that link on Lotus Domino. We all solved our problems and learned something, too.

My Search History

I decided to do some quick data gathering with a Google account. For a couple of weeks, I did all my searches on Google (I ususally switch between several engines), then transcribed the results. These are posted here, but with the names of a couple of personal acquaintances redacted. They don’t need even the tiny bit of fame or spam they’d get from this blog.

With a (small) set of personal queries, it should be possible to think about approaches for search result personalization. But looking at these, I don’t see any way to improve search with one person’s query stream. I can figure out the context, but there just isn’t enough data for an algorithm to get its teeth into. Also, there are so many different information needs here that any personalization would always be about four steps behind. Several of these are “gold star queries” — one query, get the info.

Want to guess some of the information needs? For extra credit, can you spot the query tactics used?

  • enterprise search
  • ultraseek
  • enterprise search blogs
  • macintosh cad house
  • coveo, sharepoint
  • sketchup
  • google terms of use
  • google terms and conditions
  • how long does cologne last
  • perfume storage
  • parfum storage
  • jean shepherd
  • proper name of friend from Rice
  • 3.15mm
  • 3.15mm lead
  • tarptent
  • yosemite webcap
  • yosemite webcam
  • yosemite conditiona
  • yosemite conditions
  • yahoo term extraction service
  • proper name of friend from HP
  • proper name of friend from HP (alternate spelling)
  • hp narrative history
  • bike helmet cover ladybug
  • Louis Menand, Tetlock
  • verio signature backroom
  • palo alto aspca
  • palo alto spca
  • pets in need
  • typekey
  • lake huntington weather
  • art in the age of mechanical reproduction
  • buck sharpening guide
  • buck honemaster
  • david rains wallace
  • adam gopnik
  • frank westheimer
  • eagle required
  • english entropy, shannon
  • glass cockpit
  • ohlone wilderness map

Copying Minor White

Karen Schneider writes about her experiences imitating the style of four different writers.

This reminds me of my photography class at Rice. I took a photo of pile of dirty sand under a freeway, and when I printed it, it looked a lot like a Minor White. My prof, Peter Brown, saw the same thing in the print. He explained to the class that we should copy as much as we wanted while in school, because that didn’t work after you graduated.

My faux Minor White was an unconscious copy, but working in someone elses personal style sounds like a really valuable exercise. Maybe I should try Minor White’s style on purpose. He’s not my favorite, but I bet it isn’t as easy as it looks.

I won’t try to copy Peter Brown, because my style already leans in that direction. The closest I can find to a web page for Peter is the announcement that Peter Brown and Kent Haruf won the 2005 Lange-Taylor prize. Check out the photos there, then go get one of his books, maybe On the Plains.

Google Linux Distro: Desktop or Appliance?

This news article on Google Ubuntu, aka Goobuntu only talks about desktop Linux, toolbars, clients, and challenging Microsoft. No mention of a base for running Google search without an appliance or web caching or GFS-based fault-tolerant file servers or any of that other server-room stuff.

Funny how people only think about server-side stuff inside Google (“they’re building their own internet!”) and client-side stuff outside (“they want to be Microsoft!”).

I wonder what is really up.

Wikis Considered Harmful to Robots

Most web sites fail to use robots.txt or the robots meta tag, but this rarely causes problems beyond a few junk pages in the index. This is not true for wiki sites. Wikis have tons of extra pages for “edit this” or “show changes”. Those are not the primary content, but a web robot doesn’t know that. The page needs to tell the robot “don’t index me” or “don’t follow my links”. That is what the robots meta tag is for.

I ran into this last week on our intranet. Our spider (Ultraseek) was visiting over 1000 URLs for every page of real content on a wiki!

One of our internal groups put up a copy of MediaWiki. This can show every revision of a page and diffs between each of those revisions. All of these are links, so a spider will follow all the links and find itself with a combinatorial explosion of pages that don’t really belong in the index. This can get really bad, really fast. The Ultraseek spider sent me e-mail when it hit 300,000 URLs on the wiki. After investigating and putting in some spider URL filters, the site has about 300 URLs and about 150 pages of content (it is normal to have more URLs than docs up to about 2:1). There might have been more than a 1000X explosion of URLs — the spider was still finding new ones when I put in the filters.

To get a feeling for the number of URLs, look at the the Names page on MediaWiki then look at all the URLs on the history tab for that page. Yikes!

At a minimum, there should be a robots meta tag with "noindex, nofollow" on the history tab and on all of the old versions and diffs. That would result in the spider visting one extra page, the history tab, but the madness would stop right there. A spider can deal with one junk page for each content page, but the thousand-to-one ratio I saw on our internal wiki is bad for everybody. Imagine the wasted network traffic and server load to make all the diffs.

Some people would suggest using POST instead of GET for these pages. That would be wrong. A request to a history resource (URL) does not affect the resource (a POST), it get the information (a GET). The response should be cachable, for instance. But please mark it "noindex, nofollow".

I don’t mean to pick on MediaWiki, that is just a specific example of a widespread problem. MediaWiki actually gets extra points for sending a correct last-modified header on the content pages. It is sad that something as fundamental as a correct HTTP header gets extra points, but that is what the web looks like from a robot point of view.

WikiPeople everywhere, use the robots meta tag! Robots will thank you, and they will stop beating the living daylights out of your servers. While you are at it, make sure that last-modified and if-modified-since work, too.

Problem-Solving Products

Here is a very clear statement about understanding your product:

“Don’t start a business if you can’t explain what pain it solves, for whom, and why your product will eliminate this pain, and how the customer will pay to solve this pain. The other day I went to a presentation of six high tech startups and not one of them had a clear idea for what pain they were proposing to solve.”
— Joel Spolsky, Micro-ISV: From Vision to Reality

I also like the advice from What color is your parachute?, that companies hire a person to solve a problem and they don’t want to get new problems. They buy products exactly the same way.

What problem does your product solve?

What problems does your product create?

Recommendation System Spam is Not New

John Battelle comments on gaming a recommendation system as a possible explanation for the embarrassing “similar items” result at walmart.com:

We all know about Google Bombing. But Recommendation System Bombing? That’s a new one to me!

Not new at all. We saw this at Infoseek in roughly 1998. As I remember, a vendor came to us with a system and we showed them how easy it was to spam. I’m sure that Firefly knew about it before then. Any kind of adaptive system (ranking, recommendation) is vulnerable. That is probably why personalization keeps reappearing on WWW engines, the login makes it much harder to spam.

Think like a spammer — if the system reacts to you, you can probably hack it.

Two Predictions for Enterprise Search

David Heinemeier Hansson and John Battelle have predictions pointing in opposite directions for enterprise search.

David’s prediction covers all enterprise software:

Enterprise will follow legacy to become a common insult among software creators and users.

John’s prediction is specific to search:

Enterprise search will show us a few new approaches to consumer search, and vice versa. In fact, we may get to the point where the two are often indistinguishable.

I disagree with both of them. Enterprise search and WWW search are more different than they seem. Yes, there is a bit of cross-fertilization, but each one has critical problems that just don’t exist for the other. WWW search engines must fight spam, sell ads, and deal with insane scale. For enterprise search, access to repositories, security, and ease of admin are essential.

Will WWW search and enterprise search be indistinguishable? Yes, for end users. Each one already has a box, a button, and a result list. Underneath, they are quite different, like two antibiotics designed for different kinds of bacteria. They may look the same, but you’d better pick the one designed for your problem.

Hello, World

I’m finally entering the public blogosphere after running an internal corporate blog for a couple of years. I work on search and spidering, mostly on the Ultraseek search engine. I’ve been on the web for over ten years and on the Internet for over twenty (my first ARPA e-mail address was before domains!).

Why “most casual observer”? My freshman physics professor at Rice was fond of the phrase “intuitively obvious to the most casual observer.” My friends and I thought that one of those would be a really handy thing to have available in the laboratory, because you could just ask them, get the obvious answer, and skip the experiments.

I don’t expect my observations to eliminate all your experiments, but I hope they will
save you some time. As Frank Westheimer at Harvard said, “a few months in the laboratory saves a few hours in the library.”