Future of Bibliographic Control

I went to the Library of Congress open meeting on bibliographic issues a couple of weeks ago. Interesting, but I think they have a long way to go. This meeting was a good stab at understanding users, both searchers and catalogers, but the tricky part is the model and system interface. How to support links and mashups and massive content generation and cataloging? There was some talk about tagging, but the anti-spam algorithms needed for low-trust, low-authority cataloging are far beyond the expertise and budgets of libraries.

The official writeup and lots of notes by Karen Coyle are good places for more thorough coverage.

Bernie Hurley from UC Berkeley gave a talk on issues today with MARC (see Karen’s notes). This was far more interesting than I expected, mostly because it was fact-based. Some tidbits:

  • MARC cataloging is expensive, even when outsourced to India
  • thesis cataloging is different, the subject areas tend to be outside of the established categories
  • MARC has more information than they use (have 175 fields but 2/3 of search is on just 3 and they show a maximum of 27)
  • it does not have the information that is needed for search and faceted browse (from Andrew Pace, NC State)
  • the book height and depth are measured for shelving, but we need the weight and thickness for mailing them (also from Andrew)

The main fields they use are:

  • Author
  • Title
  • Subject keywords
  • Date for sorting
  • LC Classification

Several speakers, both from the podium and the floor, were pinning their hopes on full-text search. I presume that is because they haven’t tried implementing it. I appreciate the optimism, but full-text is Muggle Technology, not magic. Full-text is great for finding the next 20% or 30% of stuff, but most of your good results come from great metadata (including links and attention data). As Dan Clancy (Google Book and Google Scholar) pointed out, book search is much harder than web search precisely because you don’t have as much link data (metadata). No one had any good ideas about how to get access to all that text so it could be indexed. Well, ideas besides Google Book.

Hey, why wasn’t Brewster Kahle invited? Maybe the LoC already knows what he thinks, but a position paper would be handy for the rest of us.

On-line access to content is working OK. The only complaints were about the URL fields in library catalogs. If you don’t know what MARC is, take a moment to look over MARC 856, Electronic Location and Access. It’s a little more complicated than the <a> tag.

The day started with an interesting and dangerous talk by Timothy Burke on the wonders and difficulties of serious research using our current tools (see Karen’s notes). It was mostly about searching techniques, though it wasn’t really explained that way. I would have been happier if he’d started with some terminology from Marcia Bates. The personal view was helpful, but this should be well-understood stuff by now.

The danger is aiming our tool efforts primarily at the expert user. That way lies disaster. There is really only one way to do this and succeed, and that is to follow the Rob Pike architectural rules:

  1. Simple things are simple.
  2. Hard things are possible.
  3. You don’t have to understand the whole system to use part of it.

Once you do this, the fancy tools can be built on top of it. If you design for the fancy stuff, the system will never be simple and it will probably be over-fit to an old problem (like MARC is today).

One other point from Burke’s presentation, universities no longer teach how to do literature search. Each discipline has general techniques and domain-specific ones (think chemical structure search), and this cannot be fobbed off on some other department. Striking out on your own might help avoid the prejudices of the field, but it can also mean missing and reinventing a lot of stuff.

I also saw some premature target lock-on. For example, converting subject headings to strings of standalone “subject keywords” is a lot of work, and is primarily useful for faceted browsing. Faceted browsing is good, but it is only one approach. We may be using facets because they are the best we can do with the HTML-based web apps of the past five years. Is it right for five years from now, when the conversion is done or did we just blow a wad of cash on another dead technology?

Finally, I should have asked Andrew Pace how much NC State spent on Endeca.

A side note — Google did a poor job of hosting this event. We had to park a half-mile away, there were no power strips for laptops, I couldn’t get back on the GoogleGuest net after 10AM, we had a “mini kitchen” instead of the usual wide array of free munchies (dang!), and lunch was “here’s a map of the area”. No one stood up to say “let me know if there are any problems”. A few people got power by unplugging the massage chair. Worst of all, the committee was ushered off to the Google Cafeteria, so there was no way to talk with any of them over lunch. Why have an open meeting if you aren’t going to eat together? That was golden time with users, and it was squandered.