MARC, linked data, and human-computer asymmetry

I had to put together the introductory lecture for my “XML and Linked Data” course early this time around, because I’ll be out of town for the first class meeting owing to a service obligation. Since I’m starting with linked data instead of XML this time, I found myself having to think harder about the question nearly every student carries into nearly every first-class meeting: “why should I be here?” Why, among all the umpty-billion things a library school could be teaching, teach linked data? Why does it matter?

Now, I think “because it’s part of librarianship!” is a cop-out, and I try to avoid relying on it when I explain to students why we teach what we do. (It isn’t always evitable; professional socialization is real, and important, and genuinely the only reason a few things remain in the library-school curriculum.) That means that a good answer to the “why should I be here?” question with respect to linked data explains why not only libraries, but not-inconsiderable swathes of the larger information world, find linked data a useful construct.

This question has extant answers, quite a few of them. I started to make slides summarizing them—the “MARC must die!” answer and the “Be of the web, not just on it!” answer, and the “Lose the silos!” answer—and then I said something decidedly family-unfriendly and deleted the slides. I don’t like those answers. They won’t convince a skeptical student. They won’t even convince an open-minded one. They’re too exhortatory, too condescending, too insider-baseball. I needed a simpler, more fundamental explanation, one that took into account not only the purported benefits of linked data, but the very real issues with existing data structures and systems.

To arrive at that explanation, I found myself thinking about the exact design problem on Henriette Avram’s plate when she invented MARC. That design problem was “let computers print catalog cards.” Avram was not asked to design a computer-optimized data structure for information about library materials, so naturally enough, that is not what MARC is at heart. Avram was asked solely to make computers print a record intended purely for human consumption according to best card-construction practices of the 1960s. She did that, brilliantly.

Unfortunately, that turned out to be the wrong thing to ask her to do. At the time, of course, Avram could not realistically have known that. At the time, libraries themselves could not have known that! In the decades since, however, just about everyone has discovered and rediscovered that designing data based solely on how it should look for human beings, without considering how computers may need to manipulate it, leads inexorably to ruinously messy, inconsistent data and tremendous retooling costs—exactly the challenges libraries face now.

Book and journal publishers discovered this when their carefully-typeset page-layout files turned out all but useless to them for many forms of electronic distribution. Back in the dot-com boom, I watched several publishers struggle, and it seemed to me the hardest problem they faced was making themselves step back from the human-readable look of print so that they could see texts in other ways. Learning-object developers have learned the same harsh lesson more recently, as the Flash technology many learning objects were built with obsolesces. I hate to think how much effort (not to mention money) in the mid-to-late-2000s was poured into Flash objects that cannot be used on smartphones and tablets, and whose component parts (out of which something more modern might be built) were thrown away because no one saw a need to keep them. Finally, not a few of the problems the “big data” movement is trying to solve arise from needing to force computers to make sense of material designed only for humans.

The base human-computer asymmetry fueling these problems is not hard to articulate, and is in fact what ended up on my class slide as the “why” of linked data. Given computer-friendly data, humans can instruct computers to produce human-friendly data displays, in addition to doing all the fascinating behind-the-scenes manipulation that fuels useful applications from search indexing to text mining. Given displays friendly only to humans, however, computers cannot easily break the displayed materials down into computer-manipulable data. Sometimes it can be done, but only at great cost in time and effort; sometimes it is outright impossible. Even when retooling human displays as data is possible, the additional effort is weakly justifiable at best, given the relative ease of going in the other direction.

Worse still, freezing one form of human-readable display into a data structure assumes that what is ideally human-readable now will remain so indefinitely. The brief history of design for the web, not to mention the emergence of entire information specialties in usability, user-experience design, and (more broadly) human-computer intraction, shouts otherwise. So does the story of MARC, for that matter; who would go back to 1960s-formatted catalog cards today? Thinking about the data separately from its display defends against stale design by keeping display-design options open.

Now, I’ve pulled a fast one on you all—did you notice? I’ve been treating the idea of “designing data for computers” as though it were all one thing. It isn’t. If it were, I wouldn’t have to teach one entire library-school course on relational-database design, another on XML and linked data, and a third on digital curation. Reasonable (and not-so-reasonable) people disagree on the best way to design data structures that accommodate computer manipulation while providing the best available grist for human-friendly design. I’m not sure it’s always possible to judge in advance, even; sometimes there isn’t a feasible shortcut past the hundred ways it won’t work. Context matters too; one of the awkward things about teaching XML and linked data together is that XML was really designed for a document context, not so much for the data often captured in it.

I get a sense from many frustrated bystanders watching the move away from MARC that they want to be told the one true answer, the ultimate data structure that solves every metadata problem libraries have or ever will have. I’m afraid this quest is doomed to failure; nothing is so elegant that it can never be improved upon, nothing so universal that the world will not change out from under it. In other words, we shouldn’t feel bad about needing to migrate from MARC. It’s all right that FRBR didn’t turn out to be the be-all and end-all, that RDA had to go a few rounds of redesigns, that several European libraries have decidedly different linked-data models for their bibliographic data. That’s normal theoretical and experience-based churn.

That said, I think there are a few readily-grasped principles that distinguish data structures that computers work well with from data structures that only work for humans. I don’t know that I have the perfect list of those principles, but here are the ones I start with in all my data-structures classes:

  • Atomicity, also known as granularity. Essentially this means saying only one thing at a time, and clearly (and only once) separating each thing from every other thing. Computers can build up from granular pieces of data, but they’re surprisingly bad at breaking compound, complex, or ambiguous statements into their component parts.
  • Consistency. This means saying the same thing the same way every single time it’s said. Computers absolutely thrive on consistency; unfortunately, human brains often don’t need it, so displays designed solely for humans don’t tend to feature it. MARC data particularly is absolutely notorious for inconsistency.
  • Reliable, unchanging identifiers. You think you’re bad with names? Computers are worse. All the various numbers and labels that sometimes seem to define our lives—credit-card numbers, account numbers, usernames—exist because computers cannot reliably pull together information based on human-assigned names. The resulting indirection in computer-friendly data structures is legitimately annoying, especially at first, but it’s necessary.

Following those principles won’t guarantee a perfect data structure because there’s no such thing, but these principles do lead to flexible data structures with escape hatches. Just as book and journal publishers who bit the bullet and bet on SGML had a much easier time when the web and ebooks came along than did publishers who relied solely on page-layout files, libraries that work toward more atomic, consistent, and identifier-laden data will be safer not only in today’s information world, but in whatever succeeds it.

Note: This post is copyright 2015 by Library Journal. Reposted under the terms of my author’s agreement, which permits me to “reuse the work in whole or in part in your own professional activities and subsequent writings.”