A Successful Failure: Community Requirements Gathering for DSpace

Before I start this talk, a few disclaimers. First, nobody asked me to gather requirements for DSpace. Second, nobody vetted or co-wrote this presentation. It is entirely my own work, and I own any errors in it and any offense it causes. Third, I am a notorious gadfly and crank. I make trouble. Ask anyone!

That said, then, we all understand that this presentation represents my opinion only and does not represent that of my employer or the DSpace Foundation, right? Because if anyone’s going to land in the soup for this, it should be me, okay? Good. Onward.

DSpace is more than a software package; it’s a social phenomenon. And as such, it’s in trouble. Its developer pool is much too small, which has led to a self-reinforcing problem spiral: patches and hacks languish in the queue, repository managers and end-users get more and more annoyed at how slowly DSpace improves, developers try to placate them at the cost of their coding time, and seeing all this, potential developers decide to donate their effort elsewhere.

At the other end of the community, we have a great many silent end-users. Considering DSpace’s market position as the “out-of-the-box solution,” as well as its history as a research project between close collaborators MIT and Hewlett-Packard, helps explain why. People who pick a supposed out-of-the-box solution in the first place are not coders. In fact, they’re not even expecting to have to talk to developers about the product, and they’re less likely to know how to talk to developers than people who are more accustomed to open-source and its discourse traditions. As for DSpace developers, they currently take the very research-computing, classically open-source-developer perspective that “the price of a voice is code.” So end-users are silent; given the system they’re stuck in, what else can they be?

Technically, DSpace is lagging behind competitors at present. EPrints is much more usable and easier to market to faculty; it is also immensely easier to install. Fedora Commons is much more flexible and scalable, with a better, more generalizable data model. As for software-as-a-service providers, BePress’s Digital Commons is eating DSpace’s lunch on ease-of-use grounds. Librarians I’ve talked to who have looked at both and chosen Digital Commons tell me unanimously that DSpace is too staff-intensive to install and maintain in-house, and too inflexible to run consortially.

So combine DSpace’s social problems with its technical problems, and what we see is a wild profusion of hacks. Is anyone in this room running DSpace unmodified except for HTML and CSS changes? (No hands go up.) Right, I’m not either. We’ve got embargo hacks, electronic thesis and dissertation (ETD) hacks, statistics hacks, persistent-bitstream-URL hacks, authentication hacks, researcher-pages hacks, streaming-multimedia hacks… and worst of all, none of this ever makes its way back into the main DSpace codebase!

What is a hack, fundamentally? A hack represents an end-user need that DSpace is not meeting. And while all these end-user needs are not being met, DSpace somehow has time to add controlled-vocabulary support? Is anyone even using that? (One hand goes up, slowly.) Oh, good, I’d hate to think all that effort was a complete waste. But, seriously, who is setting development priorities here? It’s not me or anybody like me. Who is even being heard when those priorities are set? Again, it sure doesn’t feel like me.

So I said to myself, I said, “IR managers don’t feel they have a voice. Let’s give them one! Developers don’t feel that the community supports them. Let’s show different!” In the best case, engaged repository managers can help convince their library administrations to throw more resources (developer time, Foundation support funding) at DSpace. At the same time, potential developers will see a functioning community that they want to participate in.

The plan I then made included synchronous and synchronous discussion options, because DSpace is global. Moreover, repository managers are accustomed to the mailing lists; not so much to IRC. I designed in an option for private communication, too, because the current atmosphere can feel intimidating. The idea was that there would be a “question of the week,” which would then be summarized to the wiki for future reference, to avoid the disingenuous “gosh, when did you say you wanted that?!” reactions that repository managers often hear when we ask for something.

The venues turned out to be: the dspace-general and dspace-tech mailing lists for public asynchronous comments; the Meebo chat system for public synchronous comments until it got bot-spammed, at which point we moved to DSpace’s IRC channel; and my email for private asynchronous comments.

Discussions included:

  • Users’ most-wanted changes
  • An item-access statistics system
  • What the “ideal” repository system would act like and be capable of
  • Suggestions for improving the item-deposit interfaces
  • Suggestions for DSpace documentation

DSpace developers were extraordinarily good citizens during this experiment, extremely active and respectful. Brad McLean, Mark Diggory, Tim Donohue, Claudia Jürgen, others, the process had their attention, and they were willing to listen. Repository managers… well, that’s where “Houston, we have a problem here.” Despite there being dozens of repository managers for every developer, only a bare handful made any effort to participate at all, though Shane Beers and Christophe Dupriez were heroes. Six questions went out, five-and-a-half discussions were held… and manager participation was so low by the sixth question that it was perfectly clear this effort was a nonviable failure.

Why did it fail? Was I the wrong person to take it on? Very likely! I ran into some time crunches, and I’m also not very popular in the DSpace community (and now you know why). Were email and IRC the wrong venues? Would conference sessions and surveys have done better? Do librarians know how to give good feedback on software? Are the right people on the mailing lists? Do the lists reach local customizers and developers, for example? Or is it something I haven’t thought of? Very likely it is!

Why was it a successful failure? Well, we did surface and document some unmet needs. Those of us repository managers who took the trouble to speak up were heard and heeded. I’m here right now, talking to you frankly and honestly about this. And in the possibly-apocryphal words of Thomas Edison, we’ve learned a way it won’t work.

I want to close by questioning the idea of “a DSpace community” that I often hear people alluding to. Is DSpace one community? Where are the library administrators in this community? And how much faith do institutions have in DSpace’s processes and outcomes? If what DSpace has isn’t a community, what is it? Librarians aren’t used to the open-source “community development” concept to begin with. Might a different model work better? And if there is no community, or if it isn’t powerful enough, will DSpace survive?

I’ve shot my rocket here. The next launch is yours to plan and execute. Thank you.


A Successful Failure: Introduction

I loathe institutional repository software. All of it, without exception. (That it helped destroy my library career doesn’t improve my opinion of it any.) As I write this in February 2015, I have seen nothing to shake my belief that it’s all terrible pointless dancing bearware with all the usability and fitness-for-purpose of the teapot commonly seen on the cover of Donald Norman books.

Because I spent my repository-management career in DSpace shops, DSpace developers caught most of my withering opinions on the subject, though Islandora has gotten an earful once or twice as well. Whether because I am a woman, because I’m obnoxious, because I am a hacky amateur coder/sysadmin at best, or because of standard open-source inattention to user needs, my specific critiques never made much headway.

I did, however, help nudge DSpace into incorporating more repository-manager feedback into its development-prioritization processes. This talk, and the effort underlying it, was how that happened. It was given at a DSpace User Group meeting on November 19, 2008.


Save the Cows! Cyberinfrastructure for the Rest of Us

Good morning, and thank you for coming. My name is Dorothea Salo, and I work for the University of Wisconsin System as an odd sort of digital archivist. I do have strong interests in the area of cyberinfrastructure, as I hope to prove to you today, and so Melissa [Woo] asked me to come here and talk to you a little bit about my angle on the whole cyberinfrastructure thing.

And I promise you will understand the title by the time I’m done talking. Cross my heart.

So, when we say the word cyberinfrastructure, some of the first things that come to mind are grid computing, in which we throw a whole lot of little computers working together at huge, massive computational problems, and data mining, in which we throw those computing resources at huge amounts of data on a scale we could never have considered before.

Of course, these processes create new data. Terabytes and petabytes of it. And now all the librarians listening to me are wincing, because our shock-and-awe sensors tripped as soon as you could fit the Library of Alexandria on a USB thumb drive, you know what I’m saying? And then the grid computing people start tossing around exabytes, and my brain just shuts down.

In the UK, what we call cyberinfrastructure is often called “e-science.” This, of course, betrays an assumption. So we don’t use “e-science” here, because it’s not just the physicists and the astronomers and the climatologists; we say “e-research” instead, because it’s certainly true that the social sciences, the arts, and the humanities are joining the party too. And with that, we add concerns over collaboration, especially across institutions and across disciplines—and doing cross-disciplinary collaboration creates sticky issues around identity and authorization and it all gets very evil and nasty and complicated very quickly.

And while we’re at it, let’s not forget the data I mentioned. An emerging professional specialty, though exactly where it’s emerging is a really good question, is that of data curation. This brings up questions of metadata, a thing dear to librarian hearts that just made the IT professionals here cringe, and data standards. We have a few of those, in a few disciplines, but not nearly enough, and unstandardized, not-uniform data is something that I think we can all agree makes us all cringe!

And then there’s the question of who’s going to do data curation. Is it an IT function? Are faculty responsible? After all, it’s their data! And what about those libraries? And by this time much screaming has ensued and much hair is being torn out.

It’s simpler than that. Thank goodness!

Scholars are using computers, in a number of different form factors, from tiny smartphones to big old server racks, in their research. This, I am sure, is not news to anyone!

All this computation produces data, sometimes as the point of the exercise, sometimes as a sort of side effect. Data takes all kinds of forms; it’s not just numbers. Word-clouds, scanned manuscripts, maps, images on wildly different scales—it’s all bits-and-bytes; it’s all reusable and recomputable—it’s all data!

This is in addition to the books and journals that librarians are familiar with and already care for. But interestingly, as these materials move digital themselves, they too can be treated as data, grist for the computational mill. This doesn’t happen as much as it should, honestly, and the reason for that is that even when these materials are digital, they’re locked up behind pay-access firewalls to protect the current scholarly-publishing business model, so the computers can’t get in to crunch on them. This is a major argument for open access to the literature—and for those of you who know me and what I do, I hereby reassure you that it’s the only open-access argument I’m going to make in this presentation.

So to recap a bit, we have our researchers, and they’re using computers, and they’re generating data. And support for that, librarians, has to happen throughout the entire data lifecycle. And that support, IT professionals, is absolutely not limited to providing computational horsepower and storage. And that support, scholars and researchers, has to include verification and documentation of data-gathering methods, so that everyone knows that everything’s on the level, and it’s got to include ways to refer back to other people’s data that you’ve used; that’s what I mean by ‘certification’ here.

That’s all this is about. Really. And that’s the cyberinfrastructure puzzle as I see it. It’s all about data.

But what is data, exactly? We’re used to thinking of data as nice bar graphs and charts, with a nice key in the corner; you can imagine one on a web page or equally well on a print journal page. This is data, right?

No. Actually no, that’s not data, not data in the sense I mean it. Charts and graphs are dead data, data that’s been killed, cut in pieces, and ground up until it’s unrecognizable, just like hamburger. Data in charts and graphs is not revivable and not reusable. For optimum reusability, we need to save data before it’s distilled into charts and graphs and tables. In other words, we need to save the cows—before they become hamburger!

(In case you’re wondering, I owe the hamburger-and-cow image to XML expert Michael Kay, who once famously said “Converting PDF to XML is a bit like converting hamburgers into cows.”)

So in tight budget times, a very good question to ask is whether it’s actually necessary to solve this problem. Even if it is, do we have to solve it now? Do we have to keep all these data?

The answer is a resounding—sometimes. But I do want to add that even when it’s not absolutely required, it’s often a really good idea. On the Madison campus, we have collected a number of stories of researchers who wish they’d done a better job keeping their data, because a new use turned up for it, often years or decades later!

So in what cases is it mandatory? Funders may require it, as the National Institutes of Health (NIH) sometimes does. Just to be clear, that’s completely separate from the Public Access Policy requiring open access to journal articles published with NIH funds. Journals may require it. Most of the funders requiring open data are in Europe at the moment, but that’s not true of journals. I can’t give you a laundry list, because it’s very discipline-dependent and also very volatile, but we are seeing more and more science journals instituting data-retention policies.

Now, the data-retention policies I’ve seen have usually been time-limited; five or ten years is common. My question is this: if you’re going to do it for five or ten years, why not plan for longer? Sure, it makes sense to assess every now and again, because some datasets do become obsolete. But don’t let your thinking be governed by journal requirements; most of the work of keeping a dataset happens before the bits hit storage, so keeping them longer is often a very low-margin business.

Here’s the catch. Some of these data stakeholders have built barns for the cows. Many haven’t. And guess who’s on the hook if they don’t? There’s nothing stopping a journal or a funder from creating an unfunded mandate to keep and preserve data. A few have. And we, collectively, researchers and librarians and IT professionals, are left dangling on the hook figuring out how to comply.

So that’s the stick. Now for the carrot. We’re keeping all these data. Why? What’s the use? What can be done with data?

  • Experimental validation
  • Meta-analysis, data-mining, mashups
  • Interdisciplinary investigation
  • Historical investigation
  • Modeling and model validation

… the possibilities are endless—IF we have the cows—that is, the data.

Is all data from “big science?” I’ve answered this already, for those who were listening at the beginning, but for anybody who came late, and just to reiterate, there’s an image of cyberinfrastructure that assumes it’s all about the Higgs bosons of this world. Physics, astronomy, and biomedicine. That’s who’s got all the data, just like they’ve got all the money.

Absolutely not. And they don’t even need our sloppy help.

A broader concern is so-called “small science,” science without the big bucks, which is frankly most scientists, not that that surprises anyone. The big guns have mostly worked out their data issues, as I’ve said. The small-science folks—a lot of them hardly seem to know where to begin.

And the sting in the tail here is that there are a lot more small-science researchers than big science. This means that if you pile up all their data, there’s probably a lot more of it! Each individual data-herd is pretty small by comparison with the Large Hadron Collider, granted. But add all those herds together, and we are talking a lot of cows.

And my dearest loves, the arts and humanities, are hardly devoid of data. A digitized image is data. A digitized book is data, and can be computed upon. The performing arts are pushing out huge amounts of audio and video—and while we’re talking storage capacity, digital video is an unbelievable headache because of file sizes. I like to think about folklorists and ethnographers while I consider digital data in the arts and humanities. Anything you can imagine is grist for their analysis mill, and yes, they are both analyzing digital data and recording their conclusions digitally.

So we’ve all got data, one way or another. And here’s the catch with that: we don’t have a service-provision model for this. Not in libraries. Not in IT. Not in most regular research practice. Nobody’s sure how it’s going to get done yet. This is part of why I’m here today. UW-Milwaukee is busily trying to sort out how to do all this.

But we do know a few things…

Cows are dumb. They will not save themselves. We know that apathy is not a solution to data management. And here we often hear someone grumbling that if this was just all paper, it’d be fine; it’s this stupid digital stuff that’s the problem. Leaving aside that data on paper are completely useless as data, we shouldn’t ignore the incredibly complex safety net that libraries have built around paper. Paper doesn’t preserve itself either; librarians preserve it! Digital data are no different. We have to take intentional action to keep data viable.

It takes a village to save the cows. Right, so who’s we? Let’s have a show of hands around the room. Librarians? IT pros? Faculty and researchers? Research support, grant administrators and the like? Right. If you raised your hand at any point, part of this is probably your problem. Which part, I don’t know, and anybody who tells you they know is lying and probably trying to sell you something.

So, can you tell a Holstein from an Angus? (I’m just going to die if there’s a dairy researcher in the room.) No, I can’t either. But researchers know their cows! The point of this little parable is that we know absolutely that data curation can’t happen without researchers helping and being cooperative with other people in the village. This is because data without context and interpretation are meaningless, like a spreadsheet with the header row chopped off. Librarians and IT pros don’t automatically understand how a given dataset fits together, how it was created, how other people will expect to search for it or use it, what different parts of it even mean. Researchers will have to learn to express these things, if they don’t already know how.

IT pros, you’re going to be running the big iron, no surprises there. But there are surprises for you in this, such as time horizons you’re not used to, mass file format migrations, metadata internal and external and relational that we can hardly imagine yet… and so on. Don’t panic, we’re all in this together, and we have examples to work from, especially on the larger scales—but by the same token, don’t make the mistake of thinking you can just sail in and solve this one. It’s complicated.

Librarians, this is your call to arms. Step up and sit at the table, or the table is going to forget that we exist. This isn’t good for the table, and it’s not good for us, either. Sure, we’re used to dealing with the published literature, and we’re fond of its authority and finality. But we’re going to have to look earlier in the lifecycle for our greatest impact:

But what I see happening is … this beautiful combination of understanding the structure of information, and understanding the code that goes behind it, and how to make it usable to the people who want to access it. I think that we used to talk about blended, or the hybrid librarian — now that’s the librarian.

“Librarian 15”
Palmer et al., “Identifying Factors of Success in CIC Institutional Repository Development

Grant administrators: cows don’t corral themselves. Neither do researchers. We need you.

And then there’s the big gray area. When I said I didn’t know who would do all this? This is what I meant. Some researchers say that the solution to data-management deficits is to teach themselves—or up-and-coming newcomers—information-management skills so that they become informaticists. Some researchers say that the answer is for researchers to learn to code. All of this will probably happen, in some fields and at some levels. I don’t know how it will all shake out, in the long run. But cross-functional training, no matter what end of the research enterprise you’re on, is probably the wave of the future.

Great. So now what?

  • Find use cases. Find the people with data problems needing solutions. I guarantee they exist.
  • Plan for infrastructure. Data infrastructure is more than computers, let’s not forget. It’s also a policy and procedures infrastructure, without which none of this can happen. And finally, as I dearly hope I’ve made clear, infrastructure is people. Fancy supercomputers aren’t worth a penny without people to use them, care for them, and take care of what they compute.
  • Build alliances. No one can do this alone.
  • Keep an eye out for opportunities.
  • Start conversations. Everyone in this room can do this, and I hope you will.

But, you may ask, what do you say to someone about their data? I recommend starting with Michael Witt and Jake Carlson’s Ten Questions:

  1. What is the story of your data?
  2. What form and format are the data in?
  3. What is the expected lifecycle of your data?
  4. How could your data be used, reused, and repurposed?
  5. How large is your dataset, and what is its rate of growth?
  6. Who are the potential audiences for your data?
  7. Who owns the data?
  8. Does the dataset include any sensitive information?
  9. What publications or discoveries have resulted from the data?
  10. How should the data be made accessible?

If this seems like common sense… good! It mostly is! Thank you—and save a cow today!


My Repository Rant

(unaffiliated with and not accepted or endorsed by Open Repositories 2014 or anyone or anything else for that matter, certainly not any employer I have ever had)

Hello. My name is Troy McClure—actually, my name is not Troy McClure. My name is Dorothea Salo. You may know me from such pigheaded rants as “Innkeeper at the Roach Motel” and “How to Scuttle a Scholarly Communication Initiative.”

I’ve been around the block a few times with repositories. As a result, I burned out on them completely, so now I’m teaching in a library school. And you know what they say about library-school instructors, right? I mean, we all know. Of course we do.

Those who can, do… but library-school instructors suck, am I right?

Well, I think that’s actually a fair cop. I can’t do repositories any more. Not “won’t,” but “can’t;” my health won’t permit it. When I burn out, you see, I don’t do it halfway. But I was no damned good at repositories anyway, so no loss.

So those who can, do… but library-school instructors suck, am I right? It’s all our fault you can’t hire, or so they say. That one is also a fair cop. It is one hundred percent my fault when you can’t find new-graduate repository managers. Why is that, then? Let me tell you about my students and me—and my students are your applicant pools, so don’t tune out, all right?

I am a mama bear about my students. I do not want them burning out the way I did. And since I’ve been around the block a few times, I am onto you. Some of you, anyway. You can read what I know about you in “How to Scuttle a Scholarly Communication Initiative.” How you’ll title a job “Coordinator” when there isn’t one damned thing to coordinate—no budget, no infrastructure, no support (developer time? direct reports? don’t make me laugh), and no local expertise except for the hiree’s. How throughout the entire library, only the hiree has any real skin in the game.

How people bounce right off the software the hiree will be forced to use. I’ve had plenty of developers blow me off about usability, and they can just cut out the devsplaining right now. It’d take a deity to run a popular, well-used repository off this dancing bearware of theirs. I have the screenshots to prove it. I’ve taught the workshops, too. I wish repository developers would realize that usability critiques are contributions to the community!

Yes, I’ve been around the block a few times. So here is my rant—wait, you thought I was ranting already? Goodness, no. I haven’t even started ranting yet. So let me start:

  • Stop writing job ads for purple unicorns. And stop blaming me for not producing them. I am not even listening to that nonsense any more.
  • My students are not your superheroes. Stop expecting them to save you.
  • My students are not your staff’s dumping ground for work they fear or don’t want.
  • My students are not your staff’s dumping ground for resentment and discontent with respect to new aspects of librarianship, either.
  • It is not my students’ job to cover for your staff’s inability or unwillingness to reskill or step up.
  • My students should not need to evangelize the rest of staff just to be able to run the repository properly—much less needing to evangelize their own reporting chains! That groundwork should be laid for them, long before they arrive.

The bottom line is, if you and your staff cannot properly support my students, you and your staff do not deserve my students.

Now, I’ve been around the block a few times. I know all your dodges. I can spot them in job ads. And because I am a mama bear about my students, I tell them, “You know, some jobs you shouldn’t take. Oh, look, there’s one right there!” And when you can’t hire a purple unicorn because my students know better than to apply to your no-win jobs? It is one hundred percent my fault. And I am one hundred percent at peace with that.

I’m done ranting now. Aren’t you glad? I am. I don’t like ranting. It just seems like the only way I can get through to people sometimes.

Now, I do want your repository to succeed. I want your library to succeed. I want open access to succeed! But most of all, I want my students to succeed. It’s my job to want that most; I work for them now, not you. So I am begging you, please read “How to Scuttle a Scholarly Communication Initiative” and take it to heart. It’s open access; there’s absolutely no excuse. Do the exact opposite of everything it says. Because burning out more repository managers isn’t good for anybody.

Thank you.


My Repository Rant: Introduction

In 2014, the Open Repositories conference added a “Repository Rants” lightning-talk segment. I didn’t go, but I felt a rant bubbling up about the unwarranted entitlement of academic libraries as employers, and how as an educator I flatly refuse to feed it. So I hacked up a quick slidedeck, recorded over it, password-protected it on Vimeo, and took it down immediately after the conference.

This is that rant. Nothing about it should be laid at Open Repositories’ doorstep.