Grab a bucket! It’s raining data!

Hi there. Thanks very much to Mark Leggott for inviting me here, and to all of you for lending me your ears for a time. You’ll have noticed that the title of this talk in the program notes is very formal and buttoned-down. ‘Representing and managing the data deluge.’ Well, I am not a formal and buttoned-down person, but when Mark approached me to speak here, I was actually scared to death to accept, and so I wrote this really terribly boring title. I’ve just up and changed it: the real title is ‘Grab a bucket—it’s raining data!”

To hear some folks tell it, it’s a golden age to be a digital librarian. Here we have an entire new form of scholarly publication—digital research data—and it’s ours for the asking! In times when we’re all worried about the future of libraries (and let’s face it, librarians, we are), this feels heaven-sent. Grab a bucket, it’s raining data, hallelujah!

It so happens that in some quarters, I am now styled the “Cassandra of Open Access.” Cassandra, for those not up on their Greek myth, was a Trojan prophetess who was cursed such that nobody believed what she said until it was too late. Being from Troy, which was of course completely doomed, most of her prophecies were fairly dire, too. “Hey, the Greeks are about to wheel a big wooden horse into your city so they can burn it down and kill everybody!” Not happy-making stuff we’re talking about.

Some people have mistaken my Cassandra-nature for an onus against open access generally and institutional repositories in particular. I’ve never had it in for open access! Who doesn’t like open access? It’s similar to what Cory [Doctorow] said yesterday, it’s hard to be against an unambiguous good like open access without sounding like a total jerk… which hasn’t stopped some publishers, of course. But I’ve been running institutional repositories for close to five years now, and the on-the-ground reality has been quite a bit… blurrier.

  • Conflicting, contradictory, and in some cases flatly impossible goals.
  • Minimal means, because of people who seem to have been reading the mythical “Frommer’s Institutional Repositories on $5 A Day.”
  • Asking for time, effort, and data from faculty without giving them any real service or any return on their time investment that made sense to them.
  • Cramming things into IRs that just didn’t fit with the very limited IR view of the digital universe, just because we hadn’t anywhere else to put them: our content didn’t fit in the container we had.
  • Completely ignoring faculty needs and desires.

I’m seeing some of the same thought and design processes happening now with regard to e-science, e-research, cyberinfrastructure, data curation—whatever you want to call it. This troubles me. So I can’t help but wonder if I’m becoming the Cassandra of Data Curation.

Optimistically, though, it’s early days yet. There’s no reason we have to make the same mistakes with data that we made with IRs. So, I don’t want anyone to think that I’m raising the problems I’m going to raise in this talk because I’m somehow against research data curation, or I think libraries shouldn’t get involved with it. I am all for research-data curation, and I believe very strongly that libraries need to get involved. I just think we should know what we’re getting ourselves into, and if that means I’m a little Cassandraic, well, so be it.

I will focus this talk on the fit between content and container, though I may touch on other things. I’ll examine some of the qualities of typical research data, then talk about digital libraries and IRs, looking hard at some of the impedance mismatches we’re liable to run into, and then I will strategize a little bit about how to make ourselves and our systems better now, before we run headlong into another mess. The lens I’m going to be looking through is a human lens, not so much a technological lens. This is not just a technology problem, I can’t say that loudly enough.

What do we know about data?

There’s a lot of it. I’ll just reiterate something we heard yesterday: Even if we admit that the Large Hadron Collider types are probably going to take care of themselves—and this isn’t something I necessarily admit; I know huge, well-funded projects that are making huge messes with their data—even if we admit that, we’re still looking at an incredible flood of stuff.

Have we got big enough buckets? I don’t know. At this juncture I feel it incumbent upon me to say the word “cloud.” Cloud. There. I have said it. I now feel no need at all to say it again. Look, I understand that storage and networking are problems that have to be solved before we can do anything else. Just—to me, it’s necessary but not sufficient, even though it seems to be getting all the attention right now. So I’m going to move on from storage size to characteristics of research data that I’m more interested in.

Data exist to be interacted with.

One thing I think we need to keep in mind about data is that they are not an end in themselves. We don’t keep data just to keep data; we do it because researchers can pick up shovels and dig around in the sands and build knowledge like sand castles! Data are there to do things with. To be examined, cleaned up, verified, refuted, corrected, number-crunched, mashed up with other data, graphed, charted, visualized… and if we treat them as though they were unchangeable museum objects—look but don’t touch, like books chained to a medieval lectern—we are actually getting in the way of making new knowledge. If nobody can do things with data, there is no point in keeping it all! That’s what CC0 is about, as we heard in a Q&A session yesterday: removing legal barriers to messing about with data. We, we librarians, need to remove technical barriers to messing about with data.

What’s more, different kinds of data have different affordances. You don’t use a plastic sand-shovel to dig a rock quarry, just the way you don’t use a backhoe to build a sand castle. The way a sociologist interacts with census data is just wildly different from the way a medical researcher interacts with MRI data. The data buckets we build will have to internalize and respect those affordances, or at the very least allow researchers to build tools on top that respect those affordances.

Data are wildly diverse in nature

… as are data’s technical environments. In other words, data are diverse, so the buckets we put them in will need to be different shapes and colors in order to respect that diversity.

Now, differences in data can sometimes be skin-deep. The difference between a digital image of a sculpture and a digital image of a physics field station in Antarctica is in some ways not much for our purposes, however different our researchers may think they are. But sometimes the differences really do matter. You can’t treat a book in TEI markup the same as a book of page-scanned images; you will be doing violence to readers of one or the other. A microscopy researcher on my campus does cell sections digitally; you can train a microscope to focus from the top of the cell all the way through and down, and then you can create a 3D cell image to play with. It’s really cool! But a system such as DSpace that treats each section image as a wholly separate and unrelated thing is making it impossible to get any knowledge out of those data.

Think for a moment about a single bucket that works for the TEI book, the book of page scans, the images of the Antarctic field station, and the microscopy data, and you’re starting to realize the scope of the data-diversity problem.

Again, we don’t control the technical environments our researchers are using to generate data. Some of those environments are proprietary, and Mike Rylander talked yesterday about why that’s a dangerous, dangerous problem. But even leaving that aside, if we’re really, really lucky, we might have a chance to make recommendations to researchers about their data. For the most part, though, we are the ones who will have to adapt to whatever they’re doing.

Data are already out there.

Why is that? It’s because we’re not creating all the digital research data out there; the researchers are. And they’ve created it in huge volumes already. So I’m really interested when Dan Chudnov says that the Library of Congress is working to capture data at world-scale and web-scale, because I want them to teach me how to do that.

So, researchers. They’re not thinking long-term about the data they’ve created. They’re not thinking past the expiration of their next grant! That means we have to. We’re the only people with a long-term time horizon. Furthermore, they’re not likely to come to us; for the most part they can’t even imagine that we can help. The inescapable corollary here is that we can’t just sit back and wait for data to come to us—a lot of it we’re going to have to go out there and rescue!

And I may be airing some library dirty laundry here, in which case please forgive me, but it’s not just them—it’s us. We have plenty of unsustainable digital projects sitting around our libraries. Just think for a second: how many different digital-library, repository, and storage platforms are running inside your library? I won’t even answer for mine; it’s a scary large number. The stuff in those platforms is in danger. We made this mess, we librarians; we have to clean it up. As Richard said yesterday, we have to set an example with our own data! How are we going to establish ourselves as authorities in describing and organizing data if our own datastores are not in order?

A lot of data are analog.

… but really want to be digital. For example, scientists still use paper lab notebooks. I wish they didn’t too! The university archivist on my campus really wishes they didn’t, because they keep trying to give him hundreds of boxes of lab notebooks that he can’t possibly find space to store! And that’s just one example. Linguistic field notes, on paper. For one of the linguists I’ve talked to, her notes are some of the only attestations of the language we have! Slides are a constant bugaboo in visual arts communities. Faculty have a tremendous volume of analog materials that would be much, much greater use if they were digital. Can we scale up to that? Again, I don’t know, and I’m not going to talk about this problem again. It’s there, we probably need to solve it, end of story.

Data are project-based.

Aha. Now we get interesting.

Exploring the Hype(r) is a dissertation. It includes its underlying data, as it says: “Explore the primary data from this research project and construct your own hypermedia ethnography.” As you may be able to see at the bottom of the page, it’s built on the blogging tool WordPress and the Center for History and New Media’s exhibit-builder tool called Omeka. These are great tools! I love them both. But what are we librarians going to do as our dissertators pile random webtools on top of each other to build their dissertations? That’s what project-based thinking gets you: total technological randomness. But our researchers think in terms of projects. The latest grant. The latest collaboration. And when it comes to technology, they’re not above doing something different and sui generis for every single one.

Data are sloppy.

By the same token, faculty are not librarians. They are messy, messy people, a lot of them. Many more of them leave petty chores like, I don’t know, organizing research materials and results—to their grad students. This means that our data buckets are not going to fill up with nice neat orderly well-described, data-dictionaried columns of numbers. Honestly, what we’re doing is catching sloppy leaks, and we can expect to be for a long, long time.

And when our systems, library systems, only accept data that’s clean and pretty, we have a problem. Data standardization is not even seen as a desideratum by data creators yet. I know most of us know this, but in the print world, the journal-article-and-book world, we have publishers to impose some kind of uniformity. Data doesn’t live in that kind of world. We may yet get there, but honestly? I don’t expect it in the length of my career.

That’s our trawl through some basic characteristics of research data. What do we have in libraries to throw at this problem?

Our Big Buckets

The digital library

People think that primary-source data, even big data is a new thing to libraries. It’s not. We were doing big digital data before the researchers were, in the form of the digital library! What did I hear yesterday, ten terabytes of TIFFs from a single digitization project? So it’s possible to think hey, we’ve got this solved! We just apply our existing digital-library infrastructure, human and technological, to this new problem.

The institutional repository

If that’s not enough, at the same time, we’ve been building another kind of digital bucket; we’ve called it the institutional repository. And again, some people think that IRs just solve the data problem. Magic IR pixie dust, or something!

Impedance mismatches

Well, it won’t surprise anyone that I don’t think that’s true. There is no magic pixie dust for research data curation, not in digital libraries and not in IRs. What we’ve done with digital libraries and IRs gives us a lot of the skill and knowledge we need to work with research data; I firmly believe that, though it’s hard to find researchers who do. But we’re going to have to do a lot of rethinking and reworking the way we do things. Otherwise, we’ll just trip all over ourselves and the impedance mismatches between the characteristics of research data and the characteristics of digital libraries and IRs. So let’s take this a piece at a time.

What do we know about digital libraries?

Where I am, we’re trying to rebrand our digital collections, because we don’t think “digital” should be what linguists call a ‘marked’ state any more. Digital is ordinary, or it should be. Digital is normal. So given that, how do you brand digital collections? If you have an idea, see me after, will you?

Anyway, what are digital libraries like? And how is that going to work with research data? Just like our print libraries, we’ve built our digital libraries carefully, out of the best materials. We’re not making digital libraries out of any old thing; we select what we’re prepared to lavish effort on. And we do lavish effort! Look at this Library and Archives Canada website! It’s a lexicon of a First Nations language called Naskapi that’s available in that language, including the fonts to represent that language—I love this site! It’s beautiful![1. The site no longer exists. It was probably destroyed in the early-2010s gutting of Library and Archives Canada.]

How are our thoughtful, careful collection-development policies going to cope with what’s already out there? How will we decide what we pick up and what we leave behind? I already see troubling signs that in the absence of better policy, cyberinfrastructure shops are deciding to help whoever has money. I don’t like that trend and hope we in libraries challenge it. When I go to data curation workshops, most people think of data curation as “the new special collections” or “the new archives.” Understanding that we can’t keep everything, they’ve come up with elaborate decision mechanisms for figuring out what to keep and what to toss.

Well, I think there’s a problem with that. It’s a human problem. It’s the faculty member who, when told you’re not going to curate his lousy badly-designed badly-described dataset, turns around and never darkens your door again—even when he’s got a dataset that will revolutionize his field. How do we harmonize the need to provide good service with the need not to swamp ourselves with garbage? I don’t know, and I suspect answers will differ, but I do know we need to figure that one out.

How are we going to rescue data when, by our standards, a lot of it is sloppy? Are we prepared for the work involved in rescuing other people’s sloppy data? Are we prepared to let other people’s sloppy data in alongside our nice clean pretty data?

And given that many datasets are project-based, are we going to pick and choose among projects? Based on their software platforms? Can we? What about dissertations, which are institutional records no matter how they’re created? We’re going to have to rethink how much and what kind of care we can and should give our data libraries. Like it or not, they can’t all look as beautiful as this; volume and condition forbid.

Production is a Taylorist’s dream.

Where I’m from, and perhaps where you’re from too, we like our production of digital objects, mostly but not entirely through digitization, to run like a well-oiled machine. It’s generally more cost-effective to do things in large volumes and in systematic ways. In the States, we call this a “Taylorist” way of going about things. For those who don’t read management literature, Frederick Taylor was the guy who taught Henry Ford how to run auto production. Taylor measured how long it took people to do things, and made it so people had to make the fewest and smallest motions possible to get the work done.

What Taylorist production methods mean in a digitization context, of course, is that you tend to limit the type of work that you do to what you can easily automate and train for, which in practice means only a few kinds of data per library. We do our image collections or our newspapers or our finding aids or our text collections—we in essence specialize ourselves by data type, again for efficiency’s sake.

How well is that going to serve us when we’re not in control of the data-creation process? When the data don’t fit into the buckets we’ve designed for our own particular digital-data specialties? If we’re going to come to grips with data on an institutional basis, we won’t have the luxury of specializing any more. How are we going to cope? How can we be Taylorist about gathering and describing data when the data just aren’t standardized? And if we can’t be Taylorist about it, how do we keep up with the flood?

How are we going to manage when there’s a technical-infrastructure mismatch between their project silos and our Taylorist, tailored environments? We have some choices, but none of them are particularly good. Do we pull the data out and start over, ignoring the effort put in on the original interface? If it’s on the web, do we take a static snapshot of the original? That feels a bit to me like pinning a gorgeous butterfly through the head, killing it, to display it in a glass case, though I have to admit that I do it because I don’t necessarily have a better option. Do we recreate the original interface, and take on the work of maintaining and improving it? Those don’t sound like Taylorist processes to me!

I’m frightened—honestly scared to death—at how many librarians do not realize that this is a problem. They really seem to think that you wave a magic wand over somebody’s random dataset and it miraculously shows up in a repository! It does not work that way! For every new input, somebody has to figure out what’s in there, how best to represent what’s in there on the repository technology platform (whatever that is), and how to move the old representation into the new one. That… looks suspiciously like work. No, look, I do it—trust me, it’s work.

Where I work we’re starting to think and talk very seriously about this, because our digital-library processes are very Taylorist, and we’re realizing that that’s not serving us well as smaller and more specialized projects come our way. Everything right down to how we budget projects is going to have to change. Honestly, we’re finding this a struggle—but a necessary one, and one that I am proud to say that we’re confronting head-on.

Some of you are looking at me right now with utter bemusement. Your digital-library production isn’t Taylorist at all! You only wish it were. What it is, is completely ad-hoc. Something interesting comes in, you build a way to deal with it, you slap it up on the Web somehow or other, problem solved.

And thus are born project silos, both inside and outside libraries!

One of the problems with project silos is that they aren’t replicable across libraries and institutions… and the last thing any of us need is to reinvent the wheel! If you’ve never looked at Decameron Web, I love it, check it out—there’s some nice TEI-based user-interface work in there. But I can’t build DanteWeb or CervantesWeb based on DecameronWeb; the innards of Decameron Web are opaque to me. It should be easier.

And another problem: project silos aren’t part of the web. It’s what I saw called a “cabinet of curiosities” in an article I was reading: nice to look at, but impossible to really work with. Now, this isn’t entirely the fault of library technology. It’s partly the fault of librarians who natter on about “context” as though it were the be-all and end-all. My belief is that context is fluid, not fixed; it’s constantly being built and rebuilt, rather than something trapped like a fly in amber. We have to expose our digital objects so that they can appear in entirely new contexts. That’s not decontextualization! It’s re-contextualization, and cabinets of curiosities don’t allow it. Many are content-specialized, such that presentation is content-specific. For each project silo, its own user interface. Books browse differently from maps, which browse differently from finding aids. Right? I wonder. How can we maintain all this user-interface code?

Now, I’m the last person to tell you to build The One User Interface To Rule Them All. Not possible! As I said earlier, data have affordances, ways they want to be interacted with, and we absolutely need to respect that. However, it’s possible to go too far in the other direction, building interfaces so content-specific that the content winds up in a cage of jargon and non-interoperability. That’s where I think we are in digital libraries, and it’s a problem.

These practices have a lot in common with what our researchers do! Everything is its own project with its own technology stack and its own silo. Well, this isn’t workable. It’s wasteful duplication of technical effort, for one thing; why build—oh, a tagging infrastructure—more than once? It also creates huge headaches for discovery processes and especially for digital preservation. The more interaction you have to preserve, and the more different ways it’s coded, the more lines of code we’re all maintaining, and who needs more lines of code to maintain?

But it’s happening anyway, and if we’re serious about data we’re going to have to deal with the result.

Now I’m going to go a little Cassandra on you—we have already lost a lot of digital projects to the project-silo problem, particularly in the digital humanities. Some of those projects were ours: developed in libraries, but not sustainably. I predict with absolute confidence we will lose more such projects. There is a crying need for academic librarianship to develop a coordinated, collaborative rescue effort for early digital projects, if only to stem the bleeding.

On a happier note, if we do take the trouble to rescue our own projects, we will learn a lot about rescuing other people’s. I think that that learning process all by itself should be incentive for forward-thinking academic libraries and librarians to start undertaking rescue efforts.

So that’s where we are with digital libraries, and where I think our practices are going to come up short in the new data world.

What about institutional repositories?

The word “institutional” is becoming a serious problem. I would argue it always was. In my worklife, if I run into digital objects needing archival, I cannot go anywhere near them until I prove a link to one or more faculty members in my home institutions, and the weaker that link is, the more red tape and bureaucracy I have to go through to get permission to help with the project—no matter how important I think that project may be.

The problem is most acute for already-existing data. For example, think about what happens when a researcher leaves your institution for a different one. Their institutional web presence tends to remain behind. There may be valuable data there. But can the IR get involved, if the researcher doesn’t have a connection to the institution any more? Of course, it also means that data at institutions without IRs just fall between the cracks. Definitely not ideal, not what we want.

This is another aspect of data sloppiness. A lot of them don’t clearly belong to one institution, or indeed to any one institution! Consider something like a disciplinary data or e-print repository. One of those just came up for rescue, the anthropology repository known as Mana’o. Would I, as an IR manager, like to rescue it? Sure! Do I have the technical capacity to do it? Mostly; I could at least take a stab at it. Can I do the rescue? Oh, goodness no. Not in my remit; I’m not allowed. It’s not institutional data, so I can’t touch it.

So it follows, at least to me, that if we’re going to grapple with data in our institutions, we will have to give up on the purely inward-looking focus that IRs have had. Maybe different institutions will choose disciplinary specialties to focus on. Maybe we’ll just drop the idea that data have to originate within our institution before the institution is interested in them. I don’t know. But if IRs are going to play in the data space, something in the policy environment has to give.

This restriction, this institutional cage, is an artifact of the scholarly publishers; it’s not something libraries invented. Some publishers allow self-archiving only in “institutional” web presences. If an IR opens itself to a lot of stuff that doesn’t have strong and obvious ties to the institution, it is opening its institution to a very real legal risk, a risk that some publishers will sue the institution, making the argument that it’s not an “institutional” repository any more because it contains non-institutional content. But the reality is that research does not stop at institutional borders. And the more that IRs cling to that institutional cage, the less we can actually do to salvage and protect research data.

Data diversity

Unlike digital libraries, at least in theory, IRs were supposed to accept any kind of digital content or data at all! But the snag there is that they’re not really designed for it; they’re optimized for research papers. So in practice, you get the famous Henry Ford statement about Model T cars: you can have any color you want, as long as it’s black!

The “we’ll take anything” promise is broken and has always been broken. We’d take anything immutable. I use this photo of a junkyard advisedly, because for a lot of faculty, once something they produce is static and final and immutable, it’s junk! It’s out of their sight and they don’t care about it any more. So it never gets deposited in the IR to begin with, which means nobody’s taking care of it. The researcher sure isn’t; it’s old news.

The “static and final” model is absolute garbage for interactive data. It’s especially garbage if interacting with the data is one of the ways that the data are made more reliable! Maybe the first reduction of the data is wrong. If we then can’t change it because our repository only handles what’s final and static… we are not serving the need here. It’s also not ideal for what’s already out there. We know a lot of that stuff is in bad shape, but if we wait to ingest it until we can clean it up into an acceptable final form, we may lose it altogether.

IRs are lousy at dealing with data diversity. I’ll have a few more words about this later, but for now I’ll just state the obvious: putting research data into a user-interface optimized for research papers is a total loser. Papers have built up a lot of uniformity over the centuries we’ve had journals. Data are a whole different story.

Again, all this is a profoundly human problem, and another place where the technology we created has an impedance mismatch with the way researchers actually work and think. Richard’s Q&A session yesterday brought up a key problem with the static-and-final idea: sometimes you think something is static and final when it’s really, really not. And some things are just not even meant to be static and final! DSpace, for example, assumes the static-and-final, so much so that it makes correction of an item already ingested into DSpace difficult and perhaps impossible unless you’re the systems administrator. How much time I have wasted swapping out files for people, you really don’t want to know. Fedora users, don’t get smug here, because Fedora has similar problems.

We can’t accept that for data. Humans are imperfect. The artifacts that we produce are imperfect and incomplete. Our systems need to accept and work with that imperfection, allowing us to work toward perfection, conscious that we’ll never quite get there. Librarians tend to hate this point of view; we’re all about the static and final and authoritative. I am here to say we have to get over our bad selves. We have to, if we’re going to do justice to research data.


So, IRs promise to take anything you’ve got, anything at all—but you have to put it in one file at a time, like coins into a glass piggy bank. Putting data into repositories one file at a time, manually, is like emptying the ocean into a bucket with an eyedropper! And since data are already out there, we have to make it easy to dump in large quantity into our buckets. That means more APIs and protocols. SWORD is good, I like SWORD, I love what I heard about BagIt yesterday—but honestly, it’s got to be even easier than that. I want researchers to be able to push the “Archive It!” button and have it just silently, seamlessly work.

IRs promise that you can customize their look and feel, but in practice, it’s too hard. How many people in here can tell a DSpace from an EPrints install just by looking at the front page of the site? I sure can. And anyway, what you get even when you customize is this very sterile, boring, libraryish look and behavior; it’s not appealing to the researchers whose hearts and minds we need to capture. Look, I did this redesign for MINDS@UW, I am hoisting myself on my own petard here, but we need to do better than this!

Look at this gorgeous little site for Exploring the Hype(r)! Isn’t it appealing? If I promise the researcher here that I’ll take care of her data forever and ever at the cost of it losing all its visual appeal and its individualized usability, is she going to take me up on that? I wouldn’t take me up on that! So this becomes a content-recruitment problem; researchers see IRs’ ugly, pathetic little one-horse interfaces and interaction patterns and they run screaming in the opposite direction.

I know I keep coming back to this data-diversity issue like a bad record, but so much of our infrastructure just fails when confronted with it. One interface does not fit all. There is some experimentation happening in IR space. Manakin for DSpace making collection-based theming possible was definitely a step forward, though perhaps not enough of one; too much of the page-construction logic still lives in Java. The KULTUR project in the UK is adapting ePrints to be appealing to visual and performing artists. All of this is good and we need more of it, but I think we have to confront a wider issue: building our platforms with enough flexibility to be easy to customize for as much variation as we can manage. We also need to make it easy for people to construct their own look and feel on top of our stuff, or just with our stuff in it where that makes sense. Our silos really get in the way of that now, and it’s a problem.

Metadata and content models

IRs will let you have any metadata you want, as long as it’s no more complex than key-value pairs. I hate this. All the marvelous work being done with linked data, XML, semantic webby sorts of things, and all I can have in my IR is key-value pairs? What is up with that? The diversity of data environments includes diversity in metadata; I’m sure that’s a surprise to no one. It also means a diversity of metadata content models, well beyond key-value pairs. Imagine the ideal data project. It’s already well-described in an elaborate schema and well-organized. Are we seriously going to tell the provider that they have to dumb it down to key-value pairs before we can take it? Seriously? I hope not. Reality check: anybody developing a metadata standard these days expresses it in XML or RDF or both. Key-value pairs don’t cut it, and arguably never did.

As for end-user functionality, IRs can take in digital files and they can give them back. Honestly, that’s pretty much all they can do. This just kills us with interactive data. It kills us! A lot of these data need APIs. If we’re not providing them, honestly, we might as well not bother. Interact with data? In an institutional repository? Mash it up with something else? Heavens forfend—that would imply that digital objects are somehow related to each other, and that’s just crazy talk. So we have a lot of interface and API work to do. A lot of it!

Here’s a real-world example of the difficulty. A project I’m helping with for the UW-Madison Zoology Museum involves a teaching collection of animal skeletons that students measure and do comparisons on. We’re photographing those and whomping up an interface that lets students do that measurement work digitally. This saves wear and tear on fragile realia, allows distance students to participate fully, and, we hope, creates an archive that’s useful outside our campus borders.

We’re using Fedora for this, and the content modeling gets complicated. We have a specimen—say, a squirrel—which has any number of actual bones, and each bone may have several photos in various views, and this matters as far as “where do we hang which metadata” and “what do you want people to find in a search?” and “how do you display on a specimen page all of its component bones and views?” These questions have sparked a lot of entertaining (and sometimes macabre) conversations.

Now, imagine for a moment that this had been a DSpace project. Here is the content model for DSpace, the only content model: communities, collections, items, bundles, and bitstreams. “Community” is not even relevant here; “collection” sort of fits, but not terribly well. So we’re left with items, bundles (whatever they are), and bitstreams. And only items can carry metadata! I don’t need to say any more. DSpace, which is running the lion’s share of institutional repositories in the United States, is completely functionally inadequate as a serious data bucket! So much for the IR.

So where does all that leave us?

We need bigger, better buckets. I love the idea of just grabbing a bucket and going after data. I admit it’s probably an 80/20 thing; there’s 20% of the problem-space we’re looking at that we cannot realistically solve. But I know we haven’t served 80% of our users or 80% of our potential content. We can do better, and we need to.

Silos are both necessary and unacceptable. At some level data are all bits, and at that level, silos tend to be counterproductive and stupid. We shouldn’t have to build a checksum engine for sixteen different silos! Where I am, we’re working toward combining our digital library and our institutional repository on a single technical infrastructure, because it just makes sense to do that.

But because data come in thirty-six flavors and then some, once you get above the pure-bits level it’s unrealistic to think that we can design one silo that will work equally well for everything. Our infrastructure has to be flexible, it has to have APIs that other people can build on as well as ourselves, and it should make the most of the commonalities we do find in wildly diverse and heterogeneous data. Homogeneity whenever possible, flexibility where necessary: that needs to be our motto as we build these systems.

We have a lot of modeling to do. Again, because of data diversity, the content-modeling exercise I talked about with the zoology skeletons will have to be replicated, over and over and over again, as new kinds of data come our way. I don’t know if this scares you—it sure scares me. Add standardization processes on top of this, because we can expect some kinds of research data to develop standards, and it gets even scarier. Fundamentally, we need more efficient ways to do this work—a sort of meta-model for content modeling, if you will. I don’t know how that can work; I just know it has to.

We have a lot of code to write. This should be uncontroversial. I know a lot of you are already writing this code! Thank you. Now share it with the rest of us, please, because here is another Cassandraic dire warning: we cannot possibly hope to keep up with the data flood if we’re all making our own little content models and coding up discovery and dissemination frameworks in isolation. Why should anyone out there have to decide how to represent skeletons and bones? At Wisconsin we’ve done that for you!

“We love open source; no, you can’t have our code” won’t work any longer, folks. We have no choice but to figure out how to share code better. What’s more, we have to figure out how to share code with people no more technically inclined than I am, and perhaps less. Now, just a little bit about me: I hate Java. I am violently allergic to Tomcat. I don’t even like man pages! Can you build a system for me? Now think about the vast DSpace installbase out there. Think about how many of those installs happened because DSpace was supposed to be an out-of-the-box solution. Now think about how we’re going to migrate these people to something more flexible. Scared yet? I am!

Brian Owen talked yesterday about how hard it is to solve these collaboration problems. I agree with him! It’s hard. It’s a human problem, and human problems are hard. The problem is, all the alternatives to solving this problem are even harder. We’ve got to fix library-technology collaboration.

Of the digital-library and institutional-repository platforms out there today, I think Fedora is the horse to bet on. It’s the only one that comes close to the storage and presentation flexibility needed for a big data bucket, and I think the data buckets such as RepoMMan that have already been built atop it are all by themselves a pretty good indicator that it is the future. But Fedora needs to make some changes—some technical, some social. Content models, service definitions, and their associated code need to be pluggable, to avoid the wheel-reinvention I’ve said we can’t afford. I don’t entirely know how this needs to work, though the plugin and mod structures for projects like Drupal and WordPress may be models. I do know that it does need to work, or we’re all going to drown in our buckets. And then we have to build the social scaffolding to actually share these pieces of code, which may turn out to be harder than the actual technology!

Fedora also made the same mistake DSpace did with regard to the editability and replaceability of objects. Getting stuff into Fedora you can do with what Fedora hands you. Removing stuff, you can do. Editing stuff? No, unless you want to edit XML as text in an incredibly clunky and ugly Java app. Replacing an object with a better object? No. This is not acceptable, Fedora; it needs to be fixed as soon as possible.

We also have to put easier tools on top of Fedora, both on the data-producing and data-consuming ends. That’s being worked on: Islandora, Omeka-over-Fedora, lots of things, and that is all to the good. Fundamentally, we have to figure out ingest straight from whatever unholy mess a researcher has, and we have to be able to translate the affordances of a particular dataset easily into our systems. I don’t think either of those solved yet, though RepoMMan comes close; even the SWORD protocol is much too complicated for this. But solving this is not optional, because you can’t curate what you don’t have. This is a fundamental truth that the IR experience should have taught us: if our systems don’t invite deposits—even sloppy ones, even unfinished ones, even bad ones by any measurement—and if they don’t do it as early as possible in the research process, so that researchers don’t get fixated on some other software system, there’s no point to having research-data repositories at all. I know this goes against the grain, hundreds of years of library perfectionism, but I’m afraid that’s just too bad! If we’re playing in this space, we have to be ready to make some mud pies.

I’ve always, always loved the RepoMMan project for this reason, and I also really like what the California Digital Library is building. They’re starting with the good old filesystem, which we all know and more or less love, and they’re enhancing it into a curation system. It’s an approach I think will bear fruit, and that’s because they’re starting from the right place: where people actually do their work.

Now, to end on a positive note, I love the Solr app, and I think it’s a marvelous example of the kind of lightweight tool that does really heavyweight things. The beauty of Solr is that once I’ve solved the intellectual problem of “what metadata do I want to expose for search and browse?” Solr makes expressing that in a crosswalk just stunningly, beautifully trivial—and then I never have to worry about it again for that flavor of metadata. There is complexity under the hood, but our experience at Wisconsin has so far been that you don’t encounter that complexity until you actually need it, which is just perfect.

So that’s what I have to tell you. If I’ve helped you see some of these problems in a new way, if I’ve expressed them usefully, such that they get solved, perhaps I’ll get to stop being Cassandra—and instead become the Clio of data curation. Here’s hoping. Thank you!