Dirty Pages: Popularity Ranking

A few years ago I was talking with a friend about using access frequency (popularity) as a factor in ranking web pages. He pointed out that this works well for dead trees materials, too.

He used to go to the university library and start down a row of bound math journals. He’d pull one out and look at the non-bound edge. If there was a section where the page edges were dirty, that was a paper that lots of people had read. So he would read it. Then go to the next volume. Going purely by popularity lead him to top papers in areas that he might not have looked at otherwise.

Interestingly, this only works when the materials are shared, like in a library.

A Buddy, The Eleventh Essential

Scoutmaster Minute for Troop 14, March 27, 2007

You’ve probably heard about the Scout who was lost in the woods in North Carolina recently. He did a pretty good job of taking care of himself, getting water, keeping warm, but he could have done better. What are some of the things you can think of?

[Scouts answered with “stay in camp”, “don’t split up the troop leaving one adult and one Scout in camp”, “take food”.]

Those are good ideas. He did have some Pringles, which helped — those have a lot of calories.

Here is one thing that would have helped [pull a whistle out of my pocket], a whistle, one of the Ten Essentials1. A whistle makes you much easier to find. That would have helped the 200 people who spent four days looking for him.

There is another essential thing, and that is a buddy. We all learn the buddy system for Tenderfoot, because it is so important. He should have had a buddy when he left camp. When you have a buddy, you can make better decisions.

But the buddy system failed at an earlier point. He left camp to go home because his friends weren’t on the trip. He needed a buddy before he left the parking lot. He needed a buddy at the beginning of the trip.

So make sure that you have a buddy. Invite one on every trip. A buddy is the eleventh essential.2


  1. Yes, I know that the BSA calls them the “Outdoor Essentials”. I usually do, too, but that would spoil the punch line.
  2. The previous Scoutmaster Minute I posted was also about buddies, but there have been other topics, including an excellent one on focus by one of our ASMs. On the other hand, we’re working on increasing attendance on outings, so I may continue to touch on buddies, friends, and word-of-mouth.

Back to the 70’s

I’ve been listening to music from the 1970’s the past week, though it might not be the same as your 70’s music.

Tracy Nelson, Homemade Songs (1978): My favorite tracks are “The Summer of the Silver Comet”, a love song about a locomotive, and “Friends of a Kind”, a hurtin’ song for grown-ups. If you haven’t heard Tracy sing the blues, this is a good place to start. I like Doin’ it my Way (1978) a little better because it’s a smaller production and because “Time is on my Side” and “Down So Low” are so fine, but that is a vinyl release that is not on CD. Maybe you should check eBay (wow, two copies!) and dust off that turntable.

Pere Ubu, The Modern Dance (1978): You’ll either like this or hate it with the first track, guaranteed. This pretty much defines the genre “wife-annoying music” which is why it lives at work. I love it. A friend of mine used that intro as one of his “world’s most annoying ringtones”.

Keith Jarrett, The Köln Concert (1975): This was my music for must-get-done studying in college. There’s something about the piano-killer percussive playing and the sweet melodies that keeps my brain ticking over. I just did my CD replacement buy and I still know exactly where all the moans and groans are even though I hadn’t listened to it for twenty-five years. I only have one reservation about this record — it just seems wrong to have nearly memorized an improvised concert.

Cat Stevens, Catch Bull at Four (1972): You’ve probably heard of this one, since it sold a zillion copies. Most people like the hippie stuff on side 1, but I’m fond of the darker side 2 tracks, especially “18th Avenue” and “House of Freezing Steel”.

Looking at the dates, it is clear that I was exposed to a bunch of new music when I got a DJ shift at KTRU in 1978.

Future of Bibliographic Control

I went to the Library of Congress open meeting on bibliographic issues a couple of weeks ago. Interesting, but I think they have a long way to go. This meeting was a good stab at understanding users, both searchers and catalogers, but the tricky part is the model and system interface. How to support links and mashups and massive content generation and cataloging? There was some talk about tagging, but the anti-spam algorithms needed for low-trust, low-authority cataloging are far beyond the expertise and budgets of libraries.

The official writeup and lots of notes by Karen Coyle are good places for more thorough coverage.

Bernie Hurley from UC Berkeley gave a talk on issues today with MARC (see Karen’s notes). This was far more interesting than I expected, mostly because it was fact-based. Some tidbits:

  • MARC cataloging is expensive, even when outsourced to India
  • thesis cataloging is different, the subject areas tend to be outside of the established categories
  • MARC has more information than they use (have 175 fields but 2/3 of search is on just 3 and they show a maximum of 27)
  • it does not have the information that is needed for search and faceted browse (from Andrew Pace, NC State)
  • the book height and depth are measured for shelving, but we need the weight and thickness for mailing them (also from Andrew)

The main fields they use are:

  • Author
  • Title
  • Subject keywords
  • Date for sorting
  • LC Classification

Several speakers, both from the podium and the floor, were pinning their hopes on full-text search. I presume that is because they haven’t tried implementing it. I appreciate the optimism, but full-text is Muggle Technology, not magic. Full-text is great for finding the next 20% or 30% of stuff, but most of your good results come from great metadata (including links and attention data). As Dan Clancy (Google Book and Google Scholar) pointed out, book search is much harder than web search precisely because you don’t have as much link data (metadata). No one had any good ideas about how to get access to all that text so it could be indexed. Well, ideas besides Google Book.

Hey, why wasn’t Brewster Kahle invited? Maybe the LoC already knows what he thinks, but a position paper would be handy for the rest of us.

On-line access to content is working OK. The only complaints were about the URL fields in library catalogs. If you don’t know what MARC is, take a moment to look over MARC 856, Electronic Location and Access. It’s a little more complicated than the <a> tag.

The day started with an interesting and dangerous talk by Timothy Burke on the wonders and difficulties of serious research using our current tools (see Karen’s notes). It was mostly about searching techniques, though it wasn’t really explained that way. I would have been happier if he’d started with some terminology from Marcia Bates. The personal view was helpful, but this should be well-understood stuff by now.

The danger is aiming our tool efforts primarily at the expert user. That way lies disaster. There is really only one way to do this and succeed, and that is to follow the Rob Pike architectural rules:

  1. Simple things are simple.
  2. Hard things are possible.
  3. You don’t have to understand the whole system to use part of it.

Once you do this, the fancy tools can be built on top of it. If you design for the fancy stuff, the system will never be simple and it will probably be over-fit to an old problem (like MARC is today).

One other point from Burke’s presentation, universities no longer teach how to do literature search. Each discipline has general techniques and domain-specific ones (think chemical structure search), and this cannot be fobbed off on some other department. Striking out on your own might help avoid the prejudices of the field, but it can also mean missing and reinventing a lot of stuff.

I also saw some premature target lock-on. For example, converting subject headings to strings of standalone “subject keywords” is a lot of work, and is primarily useful for faceted browsing. Faceted browsing is good, but it is only one approach. We may be using facets because they are the best we can do with the HTML-based web apps of the past five years. Is it right for five years from now, when the conversion is done or did we just blow a wad of cash on another dead technology?

Finally, I should have asked Andrew Pace how much NC State spent on Endeca.

A side note — Google did a poor job of hosting this event. We had to park a half-mile away, there were no power strips for laptops, I couldn’t get back on the GoogleGuest net after 10AM, we had a “mini kitchen” instead of the usual wide array of free munchies (dang!), and lunch was “here’s a map of the area”. No one stood up to say “let me know if there are any problems”. A few people got power by unplugging the massage chair. Worst of all, the committee was ushered off to the Google Cafeteria, so there was no way to talk with any of them over lunch. Why have an open meeting if you aren’t going to eat together? That was golden time with users, and it was squandered.

Templates and Website Design

John Gruber posts about templates and design on his excellent blog, Daring Fireball. He talks about specific blogservers, but the point is true for any serious website.

I started the visual design with a blank sheet of paper, and then moved on to an empty Photoshop file. I designed the markup starting with an empty XHTML 1.0 skeleton in BBEdit. I designed the URLs on pen and paper, trying to maximize clarity and structure while minimizing cruft and length.

On the other hand, anyone who’s designed a software library is well aware that 90% of their customer’s shipping code has some chunk that was written by loading up the sample program in an editor. When I was working on ORBlite, it was pretty easy to tell who had used my sample code and not replaced the default log message (“Oops, an error occurred.”).

So, I disagree with John a bit. He makes the right qualifications for his recommendations (“… for anyone attempting to establish their own unique brand”), but that is a tiny fraction of websites, though a larger fraction of traffic.

There are plenty of websites that should be usable, attractive, and functional (utilitas, venustas, firmitas) without a ground up design. Said differently, the default templates need to be excellent, with a set of base styles broad enough to serve as useful starting points for various tastes. Even those tastes that design MySpace pages.

As John says at the close of the article, “If you start with nothing, you’re forced to think about everything.” For a designer, that’s great. For the rest of us, not so good. For good or ill, most templates aren’t that far from “nothing”.

To be specific, I’d like one, just one, template for Movable Type 3.x that has a fluid width.