We Aim to Misbehave! A Graduation Address

First and foremost, congratulations, new graduates! I know how hard you’ve worked to get here; eight short years ago I was one of you myself. I honor the sacrifices many of you and your family members and friends have made so that you would reach this day. I am proud to know you, and proud that you are all my colleagues now.

I’m told that the purpose of these graduation speeches, beyond congratulating you, is to tell you some more-or-less quotable truths about the professions you’re on the threshold of. So I’ll try to do that. The thing is, I can’t do it in this academic garb; it represents where you’ve been, not where you’re going. I need… a browncoat. Anjali, can you find me a browncoat?


For those of you who don’t know, the browncoat comes from Joss Whedon’s Firefly television series, where it’s worn by a group of rebels who fought and lost a bitter war against an untrustworthy and oppressive government. The series’ browncoat-in-chief is the character Malcolm Reynolds, who at a crucial point in the sequel movie Serenity declares, “I aim to misbehave.” And that’s what I have to tell you about all of us, every information professional worthy the name: we aim to misbehave.

And now all your family members and friends who thought you were going into nice, safe, decorous professions are staring at me in utter horror. So let me try to explain.

Librarians and archivists and other information professionals don’t aim to misbehave just because we’re argumentative and contrary people. We’re not. I, for example…


See, they know me. I am an argumentative and contrary person, and what this turned out to mean for my career as an academic librarian is that I got told, “march yourself over the librarianship county line by sundown and don’t you ever come back!” Which just goes to show, right? We don’t misbehave for the sake of it.

But sometimes… sometimes what’s happening in the world around us is just so egregiously bad that there’s no choice but to misbehave. What’s happening to information is no exception—access to information, access to information carriers and information tools, the ability to find and use and reuse and remix information without being arrested or sued, the privacy of certain kinds of information, getting the right information into the right hands—there is a lot of controversy and struggle and real pain and loss around all this, and it shows no signs of stopping any time soon! Graduates, I want you to remember always that the professions you are entering today have a long history of misbehaving, openly and proudly, when certain lines are crossed.

Now, this surprises a lot of people! Left-wing author and documentarist Michael Moore once said of librarians, “You think they’re just sitting there at the desk, all quiet and everything. They’re like plotting the revolution, man! I wouldn’t mess with them!” He said this after his publisher threatened to destroy every last copy of his already-written, already-printed book unless he rewrote it to be less critical of the United States president then in office. Who made that publisher stand down and release the book uncensored? Librarians, starting with just one New Jersey public librarian: Ann Sparanese. Ann Sparanese aimed to misbehave.

When the United States Patriot Act threatened the privacy to read of every patron in every public library in every city and town in every state across this country, librarians aimed to misbehave. In response to the piece of the law that said that library patrons couldn’t even be informed that the FBI had come to the library to snoop into patron reading histories or investigate library computers, Vermont librarian Jessamyn West designed “technically legal” signs that said, “The FBI has not been here! Watch very closely for the removal of this sign.” Four Connecticut librarians, Barbara Bailey, Peter Chase, George Christian, and Janet Nocek, fought a National Security Letter and associated gag order all the way up the chain. And when he was asked about that on national radio, George Christian pointed to the Library Bill of Rights as his guiding light. When was the Library Bill of Rights originally adopted by the American Library Association? In 1939! We’ve been aiming to misbehave for a very long time.

Coming back to academe, I’d be remiss if I didn’t bring up Aaron Swartz, who took his own life earlier this year after federal law enforcement threatened him with a felony conviction and ruinous legal fees for the alleged crime of downloading a whole lot of scholarly articles. His death is the horrible, wretched, tragic culmination of a long-standing information injustice: publishers don’t pay scholars and researchers to write down and peer-review their knowledge, but a good many publishers do insist that libraries pay outrageous, unsustainable sums of money to let those very same scholars and researchers read it—never mind anybody else! Well, some academic libraries have had just about enough of that. SLIS alumna Jenica Rogers, who is Director of Libraries at SUNY-Potsdam, aimed to misbehave last year when one publisher, one, tried to soak up ten percent of her library’s entire acquisitions budget, for one journal package for one department. Ten percent! So Jenica told that department what was going on, and she made alternate arrangements with them, and then she cancelled that supposedly-irreplaceable journal package subscription with a quickness. And what’s more, Jenica took this story public in the teeth of that powerful publisher’s anger. I know Jenica Rogers. I know she aimed to misbehave.

Archivists, think you’re safe in the arms of the past? Think again! Boston College archivists arranged to have members of the Irish Republican Army tell their stories. In return for these people’s painful honesty, the archivists promised them that their recordings wouldn’t be shared until after they were safely dead and beyond the reach of prosecution. Well, the United Kingdom government got wind of these recordings, and demanded that the archivists break their promise and turn the recordings over. All right, you tell me, what did those archivists do?


They aimed to misbehave, that’s right! Those archivists fought all the way to the Supreme Court to keep the promise they made to the people who trusted them enough to tell their truth for posterity.

Now, graduates, I want you to remember that misbehaving comes in all sizes. I’ve been talking about great big misbehaviors, but depending on what the policies are like where you work, giving a homeless person a library card, waiving a fine for someone who just lost a job, letting a frazzled undergraduate drink a cup of coffee in peace, criticizing a publisher on a blog, digitizing some kinds of information, refusing to exploit other kinds of information even when big online services make it seem all right—any of these actions could be misbehaviors. They could also mean the world to someone. So you don’t have to aim to misbehave big, just aim to misbehave right!

There are differences between the kind of misbehavior I’m talking about and a toddler’s kicking and screaming misbehavior. Obviously one of the differences is conscious, deliberate action, but knowledge is the other. I hate to tell you, after all your hard work, but you have not finished learning about information and its place in the world today—you have barely started! Don’t ever stop adding to your knowledge. Don’t ever stop changing your behaviors and your misbehaviors based on what you learn. Graduates, I want you to go out there into the world and be curious, even when some people would rather you didn’t. Ask questions, even the awkward questions, especially the awkward questions. Read, listen, watch, code, build, analyze, learn, and understand. And then aim to misbehave!

Last thing. Misbehaving takes incredible courage. I know you have courage; you’re sitting here about to graduate! But your individual courage will be all the stronger for knowing there are brave and smart people around you who have your back. With [DISTINGUISHED ALUMNA SPEAKER] Dr. [Toni] Samek, right here and now I promise you, new colleagues, that I got your back the day that you consciously, deliberately aim to misbehave. And I hope you will do the same for me, because I will surely need it.


And I tell you that most of the time—not all the time, we’re not perfect; but most of the time—your whole profession has your back when you aim to misbehave. But right now, since the whole profession is not in this auditorium, I’ll just ask you, graduates, to look to the people to either side of you, and shake their hands, and tell them, “I got your back.” Just that. Go ahead, do it.


It felt good to hear that, yeah? Didn’t it? Remember that feeling. Wherever you go, whatever you do with your careers and your lives, give your colleagues your word that you got their back when they aim to misbehave, and then honor your word. It means more than I can tell you.

I don’t know where you’re all going once you leave here today. I don’t know what will happen to you, though of course I hope for the best. I do know some of the things you will do. You will change the information professions for the better. You will change human lives for the better. You will change the world for the better. And I also know that sometimes that will hurt, and sometimes it will be infuriatingly slow and bureaucratic, and sometimes it will be scary as all hell, and sometimes just like the browncoats in Firefly you’ll have to lose a lot in order to gain a little, and sometimes you won’t even know what you’ve gotten yourselves into, much less why.

It doesn’t matter. Do it anyway. Help people anyway. Change the world anyway! Speaking some words Joss Whedon gives to Captain Malcolm Reynolds:


“Y’all got on this boat for different reasons, but y’all come to the same place. So now I’m asking more of you than I have before, maybe all… Sure as I know anything, I know this—they will try again… And I do not hold to that. So no more runnin’. [We] aim to misbehave.”

Thank you, and congratulations once again, Class of 2013!


We Aim to Misbehave: Introduction

In the introduction to this collection, I promised inclusion of one talk I’m ashamed of. This is that talk, the 2013 faculty commencement speech for the School of Library and Information Studies at the University of Wisconsin at Madison.

Because of the way I phrased my introduction to the Boston College archives situation, I gave the impression that our archives students and graduates expected safety from their work. That is wrong, and the students who felt insulted by it were entirely justified. I am sorry, and sorrier still to have insulted them on a day they should have been celebrating their accomplishments and looking forward to their futures.

The smallest rhetorical flourishes sometimes make the biggest messes. If I had only changed a few words in that introduction to make clear my true belief, parallel to my expressed belief about libraries—that many people outside archives wrongly think archives jobs are safe and uncontroversial—but I didn’t. I will not forget the lesson. I hope it helps others avoid my mistake.

This talk was also lambasted publicly for its focus on heroism and its allegedly too-obvious choice of library and archives heroes to feature. I do not apologize for this. I in no way think it was wrong. I deliberately chose this focus to explain to our graduates’ family and friends in attendance why they should be proud of our graduates and their accomplishments.

If my non-librarian family and friends are any indication, many people sitting in that audience were still mired in stereotypes about information work and the people who do it—indeed, a few of our guests came up to me at the reception to tell me my talk had completely revised their mental image of librarians and archivists! The “usual suspect” heroes any modern librarian or archivist knows about, the ethical commitments and advocacy work that good professionals take for granted, were completely new to these people.

I am not and will never be sorry to have taught our graduates’ well-wishers something. I am and will always be sorry I insulted some of our graduates.

Curiously, no one (that I know of) argued with the humorous frame I chose. In case anyone is curious, however, SLIS faculty commencement speeches have a long and honorable history of gimmicks and jokes. The year I graduated from SLIS, Dr. Edwin Cortez intentionally spilled a glass of water all over his commencement regalia. Another year, Dr. Stephen Paling knighted the entire SLIS graduating class with a toy Jedi saber.

Chances are I will someday be asked to do another faculty commencement speech. I commit absolutely to doing better then than I did in 2013… but I do think I will keep the humor.


Soylent Semantic Web Is People!

So hello, guten morgen. My name is Dorothea Salo, and I teach many nerdy things, linked data among them, at the School of Library and Information Studies at the University of Wisconsin at Madison. I first want to say vielen dank—thank you very much—for inviting me here, and I hope I can kick off this conference in a fun and useful way.

It’s the year 2022… People are still the same.
They’ll do anything to get what they need.
And they need SOYLENT GREEN.

Soylent Green movie poster

In 1973 Charlton Heston starred in a science-fiction movie called Soylent Green. And it’s a terrible movie, talky and preachy and weirdly acted and often just ridiculous, so I don’t feel too bad about spoiling the big plot twist. In the movie, the environment has degraded so badly that food can’t be grown, so what everybody eats is artificial foods called Soylent Whatever—Soylent Red, Soylent Yellow, and the brand-new Soylent Green. What they don’t know, until Charlton Heston yells it at the end of the movie, is that Soylent Green Is People! More specifically, Soylent Green is what happens when you make people into food. Ew. But the total nastiness of cannibalism aside, what’s interesting about this movie is that you have this whole society that has absolutely no idea that it’s completely dependent on people for its survival!

It’s the year 2013… Data are still the same.
We’ll do anything to make sense of them.
And for that we need PEOPLE.

Now, we’re not cannibals here in Hamburg; we don’t actually eat people. The parallel I want to draw here is that the original Semantic Web vision curiously lacked people, except maybe as the end-user beneficiaries of linked data. I mean, you can go back and look at what Berners-Lee and his cronies wrote, and you have all these people booking travel and getting health care or whatever because of all the nice clean shiny RDF data whizzing around in nice clean shiny server rooms, sure. But the data whizzes around all by itself. Doesn’t need people. There are no people. Just data.

I just think this is a counterproductive, even dangerous, way to frame the Semantic Web—and it’s still much too common. So I assert that the Soylent Semantic Web Is People! Because I want a human semantic web. A humane semantic web. Technology without people is just dead metal and silicon. Data without people is just noise.

It’s the year 2013… Data are still the same.
We’ll do anything to make sense of them.
And for that we need LIBRARIANS.

And more, since we’re here at Semantic Web in Libraries, I will assert that Soylent Semantic Web Is Librarians! We are the Semantic Web, and the Semantic Web is us! I know that isn’t completely news—we invented SKOS, we invented Dublin Core, we have Karen Coyle and Diane Hillmann and Ed Summers, just for starters—but if you had to ask me why this specific conference is important? That’s what I’d say. The Soylent Semantic Web Is Librarians.

What I want to do today is explain my thoughts about why the Semantic Web is not soylent, not made of librarians, not made of people. I want to explain why it should be soylent. And I want to challenge you in specific ways to make it soylent. My ultimate goal, which I imagine you share, is strengthening library adoption of linked data.

Let’s decide, in approved RDF-triple style, just what properties we can assert about librarians and linked data. And the usual properties I would expect people at this conference to suggest would be the technical ones. Librarians model linked data. Librarians crosswalk to linked data. Maybe as simple as librarians make linked data. Librarians host linked data. Librarians archive linked data. Librarians build systems for, and around, linked data.

But none of those properties really belong to the Soylent Semantic Web, the Semantic Web made of people. These properties are about the data, not the people.

Here are some things librarians do, as people, in the Soylent Semantic Web. We investigate linked data. We discuss linked data, sometimes not as knowledgeably as linked-data advocates might like. We learn about linked data. We teach about linked data. We advocate for linked data. Or don’t. And now we get to the crucial point: we adopt linked data.

Or we don’t. We mostly don’t adopt linked data, in fact, and we don’t because the Semantic Web community, librarians included, hasn’t acknowledged that it needs to be soylent. We forget that the Semantic Web is made of people, lots of different kinds of people, some of them people who are not like us and do not do the same work we do and do not have the same understandings we have. We forget that we need our own librarian colleagues to help us make the Semantic Web, and put library data into it—and when we forget our librarian colleagues, our librarian colleagues forget us, and forget linked data. And that’s not good.

As I talk to librarians about linked data, what I hear back is that they feel ground up into hamburger—sorry, sorry, I had to—by the whole thing, because the way it’s usually explained to them, it’s so abstract and so divorced from the actual library work they know. The linked data movement can show them graphs, but it can’t show them interfaces for doing their work. It can tell them about triples, but it’s not telling them how the catalog will work if their Internet connection fails. It can explain ontologies, but not how they’ll navigate them.

After one explanatory talk I gave, I had one cataloger tell me with immense frustration, “I just don’t see how this will work!” And I didn’t have a good answer for her, because I don’t see that either.

This has happened before.

Now, switching away from Soylent Green briefly to—anybody recognize this? I took it from the remade Battlestar Galactica television series, which uses it as a catchphrase. This is not the first time an upstart technology has tried to upend an entire established infrastructure, along with the people using it.

At the turn of the century, I was working in publishing: specifically, electronic publishing, and even more specifically, ebooks and electronic journals. While some of the big journal publishers climbed onto the XML bandwagon, many other journals didn’t, and the trade publishing industry just never did. I remember sitting in an ebook conference next to a high-level editor from a Big New York Publisher, and we were listening to a fairly basic, fairly standard introduction to XML, and I heard her sigh “This is just not my world any more.” She felt alienated. She felt alien. Is there anybody in this room who hasn’t heard a colleague express that alienation?

Even worse, XML didn’t make publishers’ lives easier—it made them harder! Editing, typesetting, indexing, all these workflows got hugely more complicated for what looked at the time like super-dubious returns. And the XML community took no notice whatever of their difficulties, the difficulties actual people were having doing actual publishing work with XML. Why? Because the XML community was having way too much fun loudly proclaiming XML’s superiority over everything ever, and going off into corners to have arcane technical arguments about XML namespaces. Not very soylent! Not humane! Not made of people!

Now, publishers did still make some XML, I grant you. I saw a lot of it. Forgive my language, but trade publisher XML was crap. It was garbage. You wouldn’t feed it to your pet Cylon, it was so bad! Which goes to show that technology that doesn’t fit into real people’s environments won’t be used properly, if it’s used at all.

How many of you knew a slide about institutional repositories was coming? Go ahead, raise your hands. If you know me, you know that I am just so sad and angry about institutional repositories. In Europe, I know, it hasn’t been quite so bad, but in the States, it’s been wretched. But it was the same thing again. There was this technology that was going to make everything better, only the people making the technology forgot all about the people who were supposedly going to use it. So we got these lousy unusable unfixable systems that did lousy useless things, and no big surprise, nobody willingly put anything in them! Because they weren’t soylent! They weren’t made of people!

Incidentally, what happened to the people running institutional repositories? People like me? Well, we got blamed. And I, for one, got out. I will never work on an institutional repository again. This is a thing that happens when systems don’t treat worker-people well. Worker-people abandon those systems, even people who truly believed in them and had high hopes for them. So when we linked-data folks lose catalogers, I think it’s a serious problem.

This will happen again.

Battlestar Galactica

We have plenty of history of technologies not succeeding because they aren’t people-conscious enough. This will happen again to linked data, if we’re not careful, and if the Semantic Web doesn’t remember that it’s soylent—made of people. I don’t want that. You don’t want that. But that’s what’s going to happen if we can’t bring more people to linked data.

It’s the year 2013… RDF is still the same.
Why do people who should know better still believe
RDF is based on XML?

Just as an example, I was at ASIST a couple of weeks ago, the big annual conference for the Association for Information Science and Technology. And I went to a session on linked data—and I won’t be any more clear than that, because I’m not here to embarrass any specific person—and I saw this on a slide: “RDF is built from XML.” This kind of thing makes me think that eating people alive might actually be an interesting lifestyle choice! Maybe you too? Because my gosh, it’s 2013, RDF never was built from XML, so why on earth do people who really should know better still believe this strongly enough to put it on a presentation slide?!

Clearly education, even really basic education, is a problem here. It’s a people problem, not a data problem, and as an educator, it’s my problem, right? I think of education as my major role in furthering the adoption of linked data in libraries: educating future librarians and archivists and other information professionals, and educating current ones, which I also do. I have to tell you, though, that current linked data infrastructure is not making this easy for me.

Give me 45 minutes, and I can drag a roomful of complete HTML novices through making an extremely basic web page. I know this because I’ve done it! Give me another 45 minutes, and I can drag those same people through the basics of CSS. Again, I know this because I’ve done it. No, they won’t be web designers after that, but they can go and practice usefully on their own and get better, and there’s a ton of resources on the web to help them. XML is a bit harder to explain and work with, but if my roomful of people is actually a roomful of librarians or library-school students, I can drag them through being able to make a basic MODS record in two and a half hours or so. I know this. I’ve done it.

Here’s the thing. I don’t know how much time it takes to drag a roomful of novices through minimal RDF competence. I’m not even sure what minimal RDF competence looks like! So essentially it might as well be infinite time. I’ve tried, I really have. I just don’t think I’ve succeeded. What are the problems I’m running into?

Part of my problem is that the training materials I have to work with force my librarian learners into stunts like trying to catch a ball while jumping off a diving board. Really, a lot of the stuff that’s out there, even I bounce right off of—and I supposedly know RDF well enough to keynote a semantic-web conference!

Take this linked-data introduction from Cambridge Semantics—and in fairness to them, they didn’t make this for librarians, but it’s still one of the best things out there. But look at it: just the first sentence and we’ve already brought in HTTP and TCP/IP without defining them, much less explaining why they’re important in this context. My learners? My librarians and library-school students? They don’t know about the alphabet-soup plumbing of the Internet. They might have heard HTTP and TCP/IP mentioned (quite likely by me, in another class), but that doesn’t mean they know. They’re just going to bounce right off this, or get distracted by something that’s actually a pretty minor and useless detail.

It gets worse. What’s the metaphor this introduction picked out, to explain linked data? The relational database, speaking of things a lot of my learners don’t know about! So this extremely well-intentioned and well-written tutorial is useless to me. It won’t help the people I have to teach, so it’s not soylent.

The answer to this dilemma is not to call my learners stupid. I warn you, I am not even going to listen to that, so don’t anybody try it. I’m also not going to listen to any suggestion that librarians can’t learn about linked data until they learn TCP/IP and HTTP and relational databases and XML and at least three programming languages. That’s ridiculous. I’ve been teaching tech to future librarians since 2007, and trust me, with most things you can meet them where they are—which can, yes, be a really low skill level—and still teach them a lot.

How does that work? The answer—the soylent answer, the answer that acknowledges my learners’ humanity and their love for what they do—the answer is respect, primarily respect for librarians’ existing knowledge base. This is the principle I try to build my lessons on: draw from what my learners already know. I try to teach linked data based on my learners’ interest in it. No surprise, for most of them, their interest has a lot to do with linked data replacing MARC. The rest of them are digital librarians and archivists, or aspiring digital librarians at any rate, and for them I keep library metadata practices in mind.

So, for the sake of time, let’s just stick to MARC. What happens when I try to translate MARC skills and practices into a linked-data context? What happens is the same thing that happened with publishers and XML—I crash my little linked-data car right into all the work that libraries now do, all the work that forms the foundations of library data, that is just impossible to even demonstrate with linked data.

I won’t tell you all my tales of woe—I have a lot of them!—but here’s one. I teach this continuing-education course that introduces XML and linked data to working librarians. This fall I wanted to add a couple of weeks on Open Refine to it, because I thought that data cleanup was important to teach. And I thought that reconciling some random spreadsheet metadata with existing linked datastores would be a cool demo, with pretty obvious relevance to real-world librarian work.

So naturally I thought about name authority control, because it’s just so basic to what librarians do, and because it’s something the rest of the linked-data world is learning to do from libraries. Even in the States, where we’re kind of behind Europe in linked-data experimentation, we have these great name authority linked-datastores, VIAF and the Library of Congress, so I thought a little reconciliation would be easy.

I learned very quickly, of course, that I can’t use VIAF from Open Refine, because there’s no SPARQL endpoint for it. I’m on the record here, so I’ll just say—you tell me why not. So I said to myself, okay, that doesn’t work, what about the Library of Congress? Naturally I went right to the source, Ed Summers, because who wouldn’t?

Oops. Ed told me I can’t do authority-control reconciliation that way either; I’d have to download the entire dataset and load it into a triplestore offering SPARQL-endpoint capability. This is where I confess the limits of my own knowledge: I don’t know how to build a web-available triplestore with a SPARQL endpoint off somebody else’s data! And this lesson I was working on was two weeks from going live—I didn’t have time to figure it out!

I asked Twitter if anybody else had maybe done authority control with Open Refine and could show me how. I just needed a simple demo!

I heard nothing.

Let me just say that trying to put together a useful lesson about how to do actual library work with linked data was not a super-humane experience. I felt annoyed. I felt stupid. I felt frustrated. I felt like hey, if the Semantic Web is so soylent, how about I just eat up all you linked data nerds? And I am a vegetarian!

Authority control is basic, basic stuff, folks. Many librarians consider it a touchstone of library practice, something central to our professional identities (so to speak). If I can’t do authority control work with linked data, do not even talk to me about how linked data is more flexible, linked data is wonderful, linked data is superior—linked data is useless. It is useless for librarians in practical terms. That’s not a problem with librarians. That’s a problem with linked data.

The end of the story, just to add insult to injury, is that I discovered that Open Refine’s built-in Freebase reconciliation didn’t work. I was able to fix it, after some searching and fiddling, though, and that leads me to another thing I want to talk about, which is the state of tools available for just messing around with linked data.

I am showing you the instructions for installing the RDF extension for Open Refine—which, by the way, I think this is great and I want more things like it. These are the long instructions, mind you—there’s a shorter set on the main page. There’s a major error in these; you can’t actually get to the workspace directory from the Open Refine start page, because the start page starts on the Create tab, not the Open tab. I flatter myself I’m pretty tech-savvy, but I had to click around and swear a bit before I figured out what these instructions were getting at.

I ended up writing my own installation instructions that seemed to work pretty well. You’re welcome. Please don’t make me do this again. Wrong installation instructions are just not soylent, and this installation method is ridiculous on its face, not soylent at all.

If there are better tools—tools that help me help my learners get actual library work done with linked data—I do not know what they are. I’m not sure they even exist. And that’s a gigantic problem for me as an educator, and ultimately it’s a gigantic problem for you and for linked data. If I fail at my job, you know what happens. It’s what happened with XML and publishing, where XML did not help get publishing work done. It’s what happened with institutional repositories, which basically didn’t help anybody get any work done.

Soylent technologies, technologies that are so respectful of people that people jump for joy about using them, help those people get stuff done. It’s as simple as that. And this needs to be true for people who are not linked data nerds and not programmers.

Look, fundamentally, this is the same reason programmers hate MARC! MARC gets in the way of programmers getting useful work done, right? But if linked data puts every other librarian on earth in the position that library programmers are currently in, that’s not going to help linked-data adoption in libraries.

So to sum up here… because I can’t educate people well, and because the tools are so bad, and because practically nobody can actually get library work done with linked data, linked data is stuck in what I’ve seen called “negative path dependence.” What’s negative path dependence? I quote from a recent report on data sharing: “Because of high switching costs, inferior technologies can become so dominant that even superior technologies cannot surpass them in the marketplace.” Sounds like XML in publishing, right, compared to PDF? Sounds like institutional repositories against journals, right?

I’m afraid it sounds like linked data against MARC, too. Meaning no disrespect at all to the great Henriette Avram, MARC is the inferior technology here—I really believe that! But linked data, despite its superiority, can’t get library work done at this point without ridiculous costs, so it can’t replace MARC.

But it doesn’t have to be this way. This I also believe.

I’ll close with four challenges for the Soylent Semantic Web, the Semantic Web that is made of librarians and other people. I hope—and I believe!—that presenters at this conference will answer these challenges, and I look forward to seeing that… and I also hope that all of you take these challenges home and work on them.

Challenge #1: Work, not ontologies, for linked data

Here is my linked-data heresy. Feel free to turn me into hamburger for it later: I don’t care about your ontology. I don’t care about anybody’s ontology, or data model, or graph, or whatever. I do not care. Why should I? We’ve done library work without ontologies and picture-perfect data models for hundreds of years, somehow or other. Can we just get off ontologies already?

What do I care about? I care about the work I can do with linked data, and the work librarians can do with linked data, and the work my learners can do with linked data. I care about the tools that help them do that work. I care about the work skills I can realistically teach my learners that someone will pay them for—and before you say anything, “knowing an ontology” is not something employers will pay for!

So I don’t need ontologies. I need well-documented linked-data tools that I can use and teach. I need linked-data workflows, based on real-world problems and real-world solutions, that I can demonstrate and imitate. I need linked-data systems that do real library work, right out of the box. Very little of this exists today because too much of the linked-data community is off in corners having arcane discussions about owl:sameAs and HTTP-range-14—just like XML namespaces back in the day. And I’m saying, stop that. Before you write one more line of OWL or RDF Schema, write code that lets real live people do real-world work with linked data.

Challenge #2: It’s not about what you can do with linked data. It’s about what I can do with linked data.

When I was running institutional repositories, I went to conferences about them, as you do. And at those conferences I saw a lot of demos of new and innovative software hacks. And a lot of those demos were absolutely amazing—but they were completely irrelevant to me, because they were impossible to implement in my environment. So I challenge everyone here, because you are all experts already, to stop thinking about what you can do with linked data and instead think about what I can do with linked data. And what my learners can do. And what catalogers and metadata librarians and digital-library managers and institutional-repository managers and reference librarians can do! Because if you are the only one who can do what you do with linked data, librarianship writ large will never be able to do it. And if you think this is a stealth demand for better tool usability, you’re absolutely right, it is! But that’s not all it is.

This means that you need to learn about what I do, and what I can do. And what catalogers and metadata librarians and all the rest of us do. Maybe actually watching us do it? Maybe doing some of it yourselves? Yes. So I challenge you to be curious about my work environment as an educator. And catalogers’ work environments. And digital-library work environments. Find out about those, firsthand, and use what you learn to build linked-data systems that all librarians and libraries benefit from.

Challenge #3: Wow me with linked data. Wow librarianship with linked data!

My third challenge, and I’m quite hopeful about this one, actually—make me say “wow!” about something you did with linked data. And why stop at me? I challenge you to wow all of librarianship with linked data!

Some of you may remember the rollout of the Endeca-based library catalog at North Carolina State University in the mid-2000s. For those of you who don’t recall, it was this one catalog that started the whole discovery-layer movement. What I remember most about that was that the new catalog got basically zero pushback from librarianship generally, even though it was a huge change where you’d normally expect a lot of negative path dependence to kick in. Instead, everybody said “wow.” Wow, I want that! Wow, look, facets for narrowing searches! Wow, check it out, you can actually start a query by drilling down through subject headings! Wow, de-duplicated records! Wow, relevance ranking! It was just a giant leap forward from what we had. Forget negative path dependence, people wanted this functionality now.

I challenge you to make something for libraries with linked data that has as much wow as that original Endeca catalog did, so much wow that nobody even argues about linked data because everybody wants what it can do.

Challenge #4: Disrupt MARC with linked data

Okay, I’m just going to say this: If we want MARC dead—and we do—we’ll have to kill it ourselves and eat the evidence. But I have a different idea about how to do this than I think most librarians in the linked-data space do. I see linked-data effort focusing on big national libraries, big academic libraries, big consortia, nothing but big-big-big. I’m not sure that’s the right strategy all by itself, to be honest. I’m sorry for using the word “disrupt” because I know it’s a giant cliché now, but I’m serious about it. Let me explain what I mean.

Last summer I taught another continuing-education course for public librarians, about acquiring books from independent publishers and people who self-publish. And one of my learners, who is a public librarian in a small-town public library, said a very sad thing. There was no way her library would be able to buy indie or self-published books, not print and not electronic. Just no way. Why not, I asked. Because there are only two employees at that library, she said, so they can’t do any original cataloging.

That librarian and her little tiny two-person library? They’re what disruption theory calls an “underserved market.” MARC is no good for her—it’s too complicated and too expensive. If you can make a simple linked-data system that’s cheaper and easier and more convenient for her, and lets her put in all the books she wants, including indie books, and lets her patrons find all the books they want, she will use it. So will a lot of little tiny libraries that just can’t do MARC. And if linked data is so easy and so great that little tiny libraries with two employees use it, what’s everybody else’s excuse, right? If linked data starts small, it can take over the world from MARC! I really believe this!

So if you say linked data is so much better than MARC, I’m saying prove it, for great justice! Okay, okay, last nerd joke, I promise, but the serious point behind the joke is that there really is a social justice issue here. Linked data shouldn’t be something that only helps big libraries and their librarians. Let’s build small first, and build up from there, and then we can help all libraries, all librarians, and all library patrons. I think a linked-data catalog that small libraries and their librarians can actually use and is demonstrably better than what they have can be built. Right now, today, it can be built. I challenge you to build it, for great justice—including justice within librarianship for linked data.

So once again, thanks for having me, and I look forward to the rest of the conference!


Soylent Semantic Web: Introduction

I was not Semantic Web in Libraries’s first choice to keynote their 2013 conference. I wasn’t their second choice either. Or their third. I’m actually not sure how far down the list I was, to be honest, but “a ways” seems fair. When I suggested the title and blurb, the conference organizers politely concealed mild dismay.

I don’t mind this sort of thing; I consider it a point of pride to exceed expectations, even when I am speaking at what my body persists in telling me is two in the morning. I am also tremendously grateful to have attended the conference; I learned voraciously, and much of what I learned went straight into my classroom work.

This is also the most unabashedly geek-culture-filled talk I have ever given. Perhaps someday I will surpass it, but not, I think, any time soon.


Is This Big Data Which I See Before Me?

Hello, I’m Dorothea Salo from the School of Library and Information Studies at the University of Wisconsin at Madison, and the first thing I’m going to do is apologize for the talk title in the day’s agenda, which is a horrific misquotation of Shakespeare’s Macbeth. Totally my fault, not Eric’s or OCLC’s, sorry about that—it’s correct on the slide!

So, big data.

A well-known characterization of “big data” describes it as having volume, velocity, and variety. Big volumes of data is probably the first thing to spring to mind when somebody says “big data”—I don’t think I need to explain it—but size is not everything! Velocity matters too: how fast do these data pile up? How fast do they need to be cleaned up and used? How fast does interaction with the data need to be? How easy is it to get data where they’re going in the form they need to be in?

That gets to the third vee, variety. From a computational perspective—and computers are notoriously persnickety and not-smart about this—how clean are the data in the first place? how much effort does it take to clean them, and how much of that effort can be automated? Note that high variety is not a good thing! Ideal data for analysis is clean and above all consistent. It’s easy to understand, and simple to use a computer to mess around with. In the real world, though, big data tends to mean more variety than is wanted. So a bit of hope here for libraries as we struggle with variety in our data: it’s not just us and we’re not alone!

So keep these vees in mind as I go on talking. None of them is more important than any other; they all factor in to making the best use of big data.

Where they most breed and haunt…

So where’s big data? It’s everywhere. It’s in science—oops, the Large Hadron Collider twitched, that’s another petabyte. It’s on the web, of course, from Google to Facebook to Amazon.

Why, I can buy me twenty at any market…

What need we fear who knows it, when none can call our power to account?

And even beyond the online giants, big data has hit business, where the hype cycle is highest, and where “big data” seems to mean something like “anything we can collect about our customers or users and their behavior to correlate with other companies’ data in flagrant violation of any notion of privacy.” And I think it’s important to watch how that debate evolves, as academe and its libraries keep getting told “behave like a business!” and businesses keep behaving so horrendously.

“Why, I can buy me twenty at any market,” incidentally, is said by Lady Macbeth, and it’s about husband acquisition. That Lady Macbeth, business genius for our time!

But in these cases
We still have judgment here; that we but teach
Bloody instructions, which, being taught, return
To plague the inventor:

Big data is in education, who knew? It’s in the form of “learning analytics.” We in academic libraries should be watching this, as should folk who have served on IRBs, because it’s troubling from a student-privacy perspective and I don’t know who has more authority in academe to speak truth to power about privacy than academic librarians.

So, of course libraries have data, and we use data in decisionmaking, in asserting our value, in collection-development and service decisions, and so on. All I need to do is say “LibQual,” right? The question I was asked to address today, though, is whether libraries have, or will have, “Big Data.”

Your face, my thane, is as a book where men
May read strange matters.

Kind gentlemen, your pains
Are register’d where every day I turn
The leaf to read them.

And I have several answers to that question.

  • YES, libraries have big data. Of course we do.
  • YES, libraries have or could have big data, BUT its collection or use is somehow problematic.
  • NO, sometimes what libraries have there isn’t big data. It might be big, it might be important, but it’s not actually data, and that is often problematic.
  • Some library data could be big data, but NOT YET it’s not. And finally…
  • Big Data, SIGH. We could have big data and it’d be super-cool if we did, but something completely unnecessary is in the way.

Big data? Yes.

Some of you are looking at me right now all like “what a ridiculous question! Of course libraries have big data! Where have you been for the last twenty years?!” And you know, if that’s you, you are quite right. National libraries and some major research libraries have been in the big-volume data game for some time because of digitization, and more recently, conscious collection of large volumes of born-digital materials. Here’s what Leslie Johnston claimed last year the Library of Congress is hanging onto digitally: five million newspaper pages, some hundreds of thousands of e-journal articles, five billion web-archive files, scads of digital audio and video, and what by now is probably close to if not more than a hundred billion tweets.

Interestingly, I’ve seen news stories that hint that the Library of Congress’s Twitter database is running into a serious velocity problem! They have all the tweets, just not the computational power to let researchers or anybody else do anything with them. It’s too big a dataset to be downloadable, so the combination of high volume and a hoped-for high velocity is pretty deadly.

We’re probably all familiar with Hathi Trust’s home page by now; in fact, a lot of the institutions represented in this room are Hathi Trust members. It’s worth remembering that Hathi Trust came about in order to solve a classic big data volume problem: where the heck to put all those page scans and OCRed texts from the Google Books project! As Hathi grows and changes, we see its people tackling more problems that would sound really familiar to a big-data analyst in business or a so-called data scientist: what can we find out from this gigantic pile of bits? How do we best clean up the OCR so that linguistic and literary analysis is reliable, and how do we deal with language variation over time?

I have to tell you, as a historical-linguist-in-a-past-life and a sometime computer programmer, a lot of the analyses I see Ph.Ds proudly trotting out these days are pretty weak. I don’t just mean “the digital humanities,” either, though there’s plenty of eye-rolly work there—that “culturomics” stuff coming out of Google’s comp-sci people has some pretty obviously overbroad conclusions stemming from a failure to consider the limitations of their evidence base appropriately. There’s a lesson in that, though: with big data, we’re all learning by doing. We’ll get better at it; just give us time, and room to monkey around.

Threescore and ten I can remember well:
Within the volume of which time I have seen
Hours dreadful and things strange…

Yes, libraries collectively have big data and have had it for a long time! This is not new at all. What’s changing is that individual libraries are starting to run into high-volume and high-variety data problems. In academic libraries, for example, faculty are starting to look to us to help with research-data management. Some digital libraries are seriously getting into targeted web archiving, too.

And here’s where I go all finger-shaky at us: right now, in May 2013, most of us are not investing nearly enough in computing infrastructure and development to be able to keep up well. We heard this morning from Sarah Pritchard that data management and curation is a thing in research institutions; I’m here to tell you that the opportunity for libraries to stake a claim to research-data management and archiving in particular is a time-limited one. If academic libraries don’t prove we can help—and that means a lot more than putting together a committee or hiring one person—researchers should and will go elsewhere.

So we can have Big Data… but only if we decide we want it badly enough.

Yes… but

If you weren’t watching, you missed this one: Harvard Library for a very brief time piloted a service called Library Hose that tweeted the titles of books that had been checked out of the library, shortly after that checkout. Eyes were rolled, fusses were fussed, and the Library Hose was shut down, because honestly, it’s kind of a bad idea. But that’s only a funny example of extremely serious questions about ethical uses of the data that libraries could and sometimes do collect about patrons, individually and in aggregate, on-purpose and inadvertently: search data, patron-computer-use data, patron-behavior data.

We discussed the differentiation of libraries from other services earlier in the Q&A, but in my mind at least, it’s an easy question. We want to differentiate ourselves from Google, our search competitor? We want to differentiate ourselves from Facebook, our social-activity competitor? We want to differentiate ourselves from Amazon, our content-purveying competitor? Easy. We do not sell out our patrons through their data.


For any reason.

Even if they invite us to. No matter how tempting it is, how many nifty things we could build, or how hard our patrons push us to do things that we know could turn around and bite them, in this age of increased surveillance from government and business and black-hat hackers everywhere. These are surely “political problems” rather than technical ones, but you can’t just wish political problems away. I’m avoiding the obvious cheap shot here out of respect for the dead, but I’m sure all of you can fill it in for me. In lieu of that, I’ll just say that America Online and Netflix both learned really quickly that “sanitizing” data doesn’t, and “deidentified” data isn’t.

We don’t sell out our patrons. We just don’t. That’s our first requirement whenever we talk about using or even keeping certain kinds of patron data, or patron-traceable data. The only way to keep data safe is often to destroy it or refuse to keep it in the first place. Fact of the computing life.

Big data? No. This isn’t even data.

This is a subtle point, but one that governments are particularly struggling with as open data becomes a thing for them: it’s possible to turn data into something that looks like data but isn’t, which often defeats the purpose of collecting or sharing the data in the first place. Does this happen in libraries? You bet. And often, it happens with exactly the kind of data we’ve been discussing today.

…thereby shall we shadow
The numbers of our host and make discovery
Err in report of us.

This Traverse Area District Library web page, a sort of library-activity snapshot infographic, is brilliant and I love it. When it made the rounds of my online librarian friends, there was a chorus of “I want my library to do this!” But it’s not data. There’s data underneath it somewhere, but as presented, this is not data. It could be—and if we could collect this information from libraries all over the place, it could even be big data—but it’s not. The problem is that third vee, variety. If I wanted to compute on these numbers, I’d have to grab the HTML and laboriously write code to extract the numbers from it, and as soon as Traverse Area District Library changes their content-management system or does a redesign, my code breaks. Multiply this by all the libraries in all the cities and towns in all the states everywhere, and you see the problem.

So, acknowledging that qualitative data is often-though-not-always an exception to this rule, take this rule away with you: if it’s not computable, it’s not data, big or otherwise. Libraries have treated the computability of the data we create and collect as a low-priority consideration for far too long.

… up, up, and see
The great doom’s image!

Making an infographic or a pie chart or a data HTML table takes pieces of the data—usually not even everything—and reduces them to something that tells a story, because graphs and charts and tables almost always tell stories much better than the actual data do. So a graph or a table or a chart or an infographic is data trapped in amber. It’s very beautiful, and human beings appreciate that beauty, but you can’t get those little particles of data back out, much less do anything useful with them if you did! They’re just not computable any more. You’ve doomed your data!

Any data you’re putting out there in PDFs, incidentally? It’s not data any more! Stop that! We in libraries should be setting the example here! And we should lean on our vendors about this, too. There’s just no point in them providing data that we can’t use for our purposes.

The sacred storehouse of his predecessors,
And guardian of their bones.

Which brings me to the skeleton in the closet (speaking of bones): MARC. If I had a nickel for every cataloger who’s asked me what the problem is with MARC and AACR2 and ISBD, I would never need to work a day in my life again. Here’s the problem in a nutshell, and it’s not news, because Kim [Armstrong] alluded to it earlier with respect to harmonizing serials holdings in the CIC. The records we put into our library catalogs are marginally computable at best. If you don’t believe me, ask any programmer anywhere who’s worked with MARC records. And you heard Kim talk about Google Books and library metadata—look, Google has the smartest engineers anywhere; if they can’t compute on our data, it’s not computable. That uncomputability is costing us untold amounts of money in systems and cleanup programmers, not to mention mindshare on the larger information web that libraries are only a part of. We have got to do better.

Another aspect of the MARC problem gets back to the third vee, variety. Local practice, rule interpretations and other changes over time that don’t get retroactively fixed in old records, places where AACR2 just throws up its hands and says “as long as it’s human-readable, do what you want,”—all this increases the variety in our catalog records, which decreases their computability and reuse value. Whatever happens with RDA and BIBFRAME and similar efforts, if we end up with yet another sloppy tower of Babel, it’s not solving the problems we have.

Cataloging for your users—computers, their programmers, and their users are your users.

Strange things I have in head, that will to hand;
Which must be acted ere they may be scann’d.

Digital librarians, among whom I include myself—come on, we know we’re not off the hook here. I ran institutional repositories for six years, and I got an entire published article out of one authority-control mishap where one author had eight different name variants in the IR. Our data isn’t clean and consistent. It isn’t computable, and it can’t be aggregated usefully or consistently. Let’s not pretend!

What we can do, though, is watch the big data pioneers and the techniques they use to cut through the chaos. Natural-language processing. Fuzzy matching. If you haven’t played with Open Refine, which used to be Google Refine, you completely need to grab some random data from your catalog or digital library or wherever and do that—it’s actually really fun! If only so that you see what the possibilities are.

Big Data? Not yet.

Libraries also have data that doesn’t look all that big—or all that powerful—when you only have it from a single library, but if you add together that same data from a whole bunch of libraries, suddenly you have something super-interesting.

Ay, in the catalogue ye go for men;
As hounds and greyhounds, mongrels, spaniels, curs,
Shoughs, water-rugs and demi-wolves, are clept
All by the name of dogs: the valued file
Distinguishes the swift, the slow, the subtle,
The housekeeper, the hunter, every one
According to the gift which bounteous nature
Hath in him closed; whereby he does receive
Particular addition.

The term of art for this, of course, is “aggregation,” and it happens all over the place already; it’s nothing new. Any data, any data at all, can be aggregated… in theory. In practice, a successful aggregation depends a lot on keeping a lid on that third big-data vee, variety. It may also depend on velocity: keeping things current, fixing errors quickly, and similar speed-dependent concerns.

We shall not spend a large expense of time
Before we reckon with your several loves…

All the cataloguers in the room know this already, of course, because of WorldCat. I’m not a cataloguer and definitely no expert, but I do know that OCLC does its level best to enforce certain kinds of consistency in contributed MARC records, above and beyond what MARC and AACR2 and RDA insist on, because if they don’t, the search engine doesn’t work! And, you know, we all know they don’t do a perfect job of it… but to some extent that’s on us, because of the MARC closet skeletons I mentioned earlier.

I see thee compass’d with thy kingdom’s pearl…

Any Michigan folks here? Here’s a blast from the past for you: OAIster, which now belongs to our good hosts at OCLC. See, we’ve tried large-scale aggregation with highly heterogeneous metadata—far more variable than the MARC coming from skilled cataloguers—before. With OAIster, it didn’t work out so well. Variety in our data bit us yet again, as did some really pretty unwise and evitable structural flaws in the harvesting protocol OAI-PMH, such as total lack of error reporting and no flag for metadata-only records so that searches could exclude them.

What have we learned from the wonderful, bizarre, epic mess that is OAIster? Let’s see.

O proper stuff! This is the very painting of your fear…

We have another chance to try aggregation, in the guise of the Digital Public Library of America. It’s very early days yet, but I did want to call out one thing that I think DPLA is doing right: cutting the Gordian knot of intellectual-property rights in metadata. Long story short, some metadata is too factual to qualify for copyright protection in the US; other metadata such as abstracts clearly does qualify. But DPLA isn’t playing that game. They say very clearly: if you want to play with us, you do not play intellectual-property games with your metadata. You start up with that, we kick you out. They’re gambling, of course, that they become enough of a name to conjure with that they can make this stick. As I said, it’s early days, but I’m not betting against them—and I appreciate this approach very, very much.

Here’s what I want to know, though. Can DPLA get past the metadata-quality issues that made a mess out of the National Science Digital Library, never mind OAIster? They seem to be leaving training and quality control to their Service Hubs. Maybe that will work, but I don’t see any kind of feedback loop being built in here, and it worries me some.

Big data? Sigh. It could be, but…

And that question leads me to what if I were the Porter in Macbeth I’d call something simultaneously grandiloquent and obscene, but since I’m not the Porter in Macbeth, I’ll just call it “the graveyard of missed big-data opportunities.”

The multiplying villanies of nature
Do swarm upon him

There’s so much we should know about books that we don’t, in spite of all the effort we spend cataloging books. A public-librarian friend of mine, Laura Crossett, was simply trying once to make sure they had all the books in any series where any book in the series was circulating well. What happened? The series information stumped her. This is just a silly problem to have. Honestly, I think we have it because our ideas and practices around cataloging are so fragmented and so calcified.

And digital librarians, we only get to gloat about this because we’re often describing unique and standalone materials. Otherwise, we’re just as bad.

We need to build big data together—it’s not just the responsibility of the Library of Congress or the New York Public Library or Harvard or OCLC, it’s everybody’s responsibility. And one of the ways we do it is by eliminating redundant labor, well beyond copy cataloging even, so that we can actually do things like record series information and relationship information to further embiggen and enrich our data. I think the linked-data infrastructures that several national libraries are building can do that… if we let them!

…there cannot be
That vulture in you, to devour so many

Am I the only one who worries about the sixth V of Big Data? “Velociraptor.” When a big commercial aggregator swoops in & devours everything.

—Les Carr (in a tweet)

I can’t really add anything to this combination of Les Carr and Shakespeare. I’m just going to admire this for a second… and no, I don’t know what the fourth and fifth vees are either.

But seriously, Les is right. We saw it with serials and their metadata, we’re seeing it now with e-textbooks, we’re even starting to see it with a few kinds of research data, and I don’t know who will stop it and build a real big-data commons if it’s not academic libraries. So if you need a reason to get involved with open access and open data, this is it: it beats the heck out of the velociraptor alternatives.

And because Deb Blecic mentioned it earlier: Non-disclosure agreements are a velociraptor indicator. I don’t like them, and I don’t think any of us should. Just sayin’.

As two spent swimmers, that do cling together
And choke their art.

And here’s where, as I generally do, I bite the hand that’s feeding me. OCLC, you are clearly of two minds on this big data thing, and I think you’re hurting yourself by it. On the one hand, there’s OCLC Research, which is making amazing big data things like the Virtual International Authority File and working hard—and pretty successfully—to embed them in the larger information world. On the other hand, there’s the dog-in-manger intellectual-property shenanigans OCLC proper keeps trying to pull with the records contributed to WorldCat, which made the national library of Sweden pull out of WorldCat altogether, and is infuriating those of us who are paying attention to where the big data world is going.

Please, OCLC, get your act together. If you’re going to insist on being a velociraptor, please spin off OCLC Research so that you don’t drown it when we drown you—and we will. It will take time, just as the open-access movement is taking time, but we can destroy you and we will. Or take OCLC Research as your model and stop being a velociraptor.

… when worldcat.org do come to DPLAne

It’s not just libraries who want to treat WorldCat as a big juicy Big Data-store, either. On FriendFeed once, a librarian quoted what OCLC actually lets affiliates do with WorldCat data in the context of scholarly research, adding: “I can, barely, stretch this definition to include the work that I’m doing on my research project… but the grad student who wants to use WorldCat data for a bibliographic study of the spread of publishing in New Spain is pretty much out of luck.”

Stop it, OCLC. Just stop it. You are shutting yourself out of big data land, and when you do that, you shut us libraries out too. Hathi Trust is willing to fight in federal court to allow researchers to do research on its corpus, and OCLC comes at researchers with legalese? Stop it. DPLA insists that all contributed metadata be available for any meditated reuse, within reasonable limits of bandwidth, and OCLC gives researchers static? Stop it. Bring worldcat-org to DPLAne, instead.

Think upon what hath chanced, and, at more time,
The interim having weigh’d it, let us speak
Our free hearts each to other.

All right, I’m done lecturing OCLC, and now I’m going to lecture everyone else in this room, because I clearly haven’t made enough enemies, right? To some extent, OCLC is doing what it’s doing because it knows that a lot of academic libraries love to free-ride. We heard about this from Kim today briefly with regard to collective print management—“can the CIC let California do it?”—and, you know, I come out of open access and open source, so I’ve seen it firsthand. Contribute programmer time to an open-source project? Nope. Pay for a membership in an open-source foundation, or participate in a collective digital-preservation system? Forget about it; who has money for that? Put actual acquisitions money toward open access? Bah, we have an institutional repository… under somebody’s desk… in the third sub-basement… somewhere; that’s enough, right?

But at some point, free-riding prevents useful collective action, and I think big data is one of those points. Big data isn’t free. Open data, big or small, isn’t free. It’s really tempting to pretend it is and free-ride anyway, I get that. But free riding is slimy and lazy and unethical and we need to stop doing it. No one library can talk OCLC down off the ledge; heaven knows Sweden tried. But maybe we can talk OCLC down together, as a community. Shouldn’t we try?

Big Data? Now what?

I was asked to talk about what skills and scaffolding we need to make and use big data in our libraries. And I’m sorry, Eric, but I tried and tried to make a slide answering that question and I just couldn’t.

Every one that does so is a traitor, and must be hanged.

The reason I couldn’t is that I know what way too many academic libraries do with lists of skills—they think they can just hire some poor Macduff with a random grab-bag list of skills and call him a “Big Data Coordinator” or some such thing, and then they’ve solved the big data problem and they can go home and have a drink.

I don’t work in libraries any more in part because my own career was badly hurt by that kind of “skills thinking” with respect to scholarly communication and open access. I don’t think thinking about library services in terms of laundry lists of skills works! And I know it hurts people, because I’ve had former students come back to me for advice over it, and I’ve seen it hurt much better librarians than I ever had a hope of being.

Now that my job is preparing people for librarianship, I explicitly warn my students about skills thinking and how it manifests in job descriptions, and I tell them not to apply for those laundry-list, unsupported-single-person-in-a-disregarded-corner jobs. The dice are just too loaded against them. So if you think you’re going to hire a Research Data Coordinator, or a Digital Humanities Librarian, or one bioinformaticist, or one statistician, somebody with serious skills, and you’re going to wind that person up and turn them loose and miracles will happen? Well, I’m with Lady Macduff on this one—hang the traitors!

Instead… well, I hate the word “infrastructure,” because it’s impersonal and overused, so I’m going to suggest we build “scaffolding,” by way of a more holistic, less skills-focused mode of thinking about the opportunities big data might have for us, and what we’ll have to be and do to capitalize on those opportunities. I like “scaffolding” because—look, you can hire a Michaelangelo, but if you don’t put that scaffolding under him, he ain’t painting you no Sistine Chapel. So here are some questions I think are worth asking.

  • Who cares, right now? Who else should? (Note that this means you look at your existing staff as well as your environment, because ignoring your current people and their skills and interests is shortsighted and counterproductive.)
  • What bets should we make, now and future? (There is no such thing as a sure thing. We have to bet. Betting means risk, risk means failure. As they say, fail fast and often.)
  • What can we use, right now? How can we experiment?
  • What do we build? Fix?

And that’s where I’m at on this just now! Hope it helped, and I’m findable on Twitter and the web if you have questions.


Is This Big Data Which I See Before Me? Introduction

One of the few things I can say in favor of my wreck of a librarian career is that it left me pragmatically fearless. I do sometimes bite hands that feed me, if I think the bite bids fair to improve something important. It’s fair to say, though, that I only bite where I trust the bitten party to hear me honestly and behave toward me in the same good faith with which I bite. Where I can’t trust that, I don’t speak, much less bite.

(“Why haven’t you spoken at the Charleston Conference?” I occasionally hear. Well… aside from never having been asked to, that’s why. I do not trust that conference or many of its attendees to behave gracefully when they hear sharp critique.)

This particular presentation bit OCLC pretty hard, despite OCLC sponsoring the symposium at which it took place, at Northwestern University in May 2013. I thought then, and still think, that some elements of the organization needed their noses rubbed in some pungent problems (just to continue the canine metaphor). To OCLC’s credit, its representatives at the event took my toothy grin in good part.