Who’s Devouring Whom? Preventing Publishing Predation

So hi, I’m Dorothea Salo. I’m a librarian, and I teach in the Information School at the University of Wisconsin at Madison. Thank you so much for inviting me to share Open Access Week with you here at Loyola! It’s just about my favorite week of the year. I’m happy to come to Loyola specifically because of how long you’ve been paying attention to open access in general and Open Access Week in particular. I looked at your history of Open Access Week programs; you’ve had some amazing people here and I’m honored to become one of that number this year. Seriously, Loyola should pat itself on the back—no, I mean it, go ahead, pat yourself on the back—lots of places aren’t nearly this interested or this aware.

The first thing I want to do is validate some feelings that I suspect are being felt in this room. Lucky us, we’re sitting in a time of rapid and quite titanic change in how scholarly conversations happen and how research is published and how those publications are circulated and recorded for posterity. It is seriously a jungle out there! And since so much about an academic career is rooted in publication, all this change generates a lot of career anxiety. It’s like the world is changing out from under us, all the rules we’re familiar with are in flux, and what are we supposed to do, right? If you’re anxious, I want to tell you you’re not alone. Of course you’re not, how could you be? So I want to say, this stuff is not changing at you; it’s just changing. And I hope by the time I’m done today you start to understand some reasons that it’s changing, and even why you might want to help that change along.

I organized this talk around a question: who’s eating whom? If you think I chose this theme because so-called “predatory publishing” is back in the news, you’re absolutely right, congratulations! I totally intend to talk about that. But what I really want to do today is expand your sense of what predatory behavior in scholarly communication is, okay? Because a lot of past and present predation—and there really isn’t a better word for it, I like the word “predation” a lot here—tends to slide under a lot of people’s radar. You don’t see it happening, you don’t see its victims, you don’t think about its effects on your colleagues, on your students, on your discipline, on research generally, on the world generally. Lots of things just become “the way things are,” no matter how predatory they actually are if you look at them closely. I know we have some librarians in this room, so I just want to say, don’t get smug, librarians. We like to think this is our turf, we’re awesome, we get this right. But some of us are as clueless as the most newbie graduate student ever, and a lot of us are a lot lazier about open access than we ought to be, and we need, as a profession, to do better than that.

Now, the danger of framing things in this predator-prey way is, everybody in scholarly communication is feeling like prey right now! Right? We all feel like minnows being chased by sharks here! There isn’t anybody in this system who doesn’t feel like they’re lunch. So for the duration of this talk, I’m asking everybody in this room to entertain the notion that maybe, just maybe you too are a predator. Maybe some of the things you yourself do hurt others in the system. Maybe you enable predation—maybe you yourself don’t eat the cute minnow, but you lead the shark to it, or you chase it a lot so it gets tired out and a shark eats it. I don’t want to beat this metaphor into the ground—um, probably too late there—but I want us to recognize and own the things we ourselves do that cause harm to others in the system. I want us to believe that maybe it doesn’t have to be this way. I want us to affirm today that scholarly communication is our system and we can change it for the better—maybe we can go vegetarian, as it were. Borrowing a line from Disney and blessing fair use for my ability to do that: Fish are friends, not food!

In other words, no honey badgers here today, okay? For those not up on their internet memes, honey badgers are incredibly predatory, and they famously don’t care about anything or anybody. Honey badger don’t care! Come on, I’m from Madison, badgers were inevitable. But no, no honey badgers here; Loyola really does care. I took this straight off your website, the mission page: “a profound commitment to the poor”—including, I hope and believe, the information-poor—“and to issues of social responsibility and justice. Ignatius believed that it is imperative to act upon what is learned—to use our values in service to humanity.” Wow, I love this. I love it so much. It is beautiful and brilliant and very librarianly and I love it. I want to add this to it: anybody acting as a publishing predator has lost sight of social responsibility, justice, and service to humanity. I firmly believe this. And we’re all predators sometimes, right? I just said that. So we all have some work to do to reconnect our publication-related behaviors with this mission. Here’s a few examples of what I mean.

Here’s predatory behavior for you. Want to read this article? Fork over 30 bucks to get past the paywall, buster. This one actually hit me personally. I wanted to read this article because it’s dead on point for a course I teach, but our library doesn’t subscribe to this journal. I didn’t want to read this article so badly that I was going to pay 30 bucks for it, much less make my students do so, so I just didn’t read it. Paywalls are predatory behavior! My students and I fell prey to this particular one. Chances are, most people in this room have fallen prey to other paywalls.

But I want to suggest also that the authors of this article, Marjit, Sharma, Sarkar, and Krishnamurthy, got turned into lunch by this paywall. They didn’t write this article for me not to read it! They don’t get anything out of keeping it away from me and my students! The authors are certainly not seeing any of that thirty bucks! They want to be paid in intangibles like reuse and citation and credit, but because I can’t actually read this, I can’t reuse it, and I won’t have any reason to credit or cite the authors, right? So the authors become lunch too. Every time the normal cycles of teaching and research and learning get eaten up like this, it makes me sad and angry. Maybe you too? It’s not okay.

On Twitter there’s this little acronym, “tl;dr,” that stands for “too long, didn’t read.” I’ve seen a couple of knockoffs of that acronym calling out paywalls. This one, “bp;dr,” stands for “behind paywall; didn’t read.” I’ve also seen “pw;dr” for just “paywall, didn’t read.” Every time somebody doesn’t read something because of a paywall, there’s predatory behavior somewhere, somebody’s becoming somebody else’s information lunch. And speaking of Twitter, the #icanhazpdf hashtag is also a thing, and in my book it’s just more evidence of how broken everything is. However you want to abbreviate or hashtag it, this is not okay. Nobody wins when people who care about our research get turned away by a paywall! And this happens all the time. It happens to us here in academe. It happens to our students. It happens to K-12 educators and their students. It happens to public librarians trying to guide people to good information. It happens to practitioners out in the world trying to apply research to real-world problems, it happens to policymakers, citizen scientists and citizen humanists, grant agencies and grant reviewers, journalists… all kinds of people with all kinds of legitimate interest in reading research!

And it shouldn’t happen. This is not service to humanity, this is not care for the poor, this is not social responsibility and it’s not justice.

An objection I sometimes hear from faculty to this line of thinking is that hey, they’ve got access to whatever they need, so obviously nobody’s being predatory, the system works, it’s all good. That’s just the most predatory thing I’ve ever heard! It’s basically “forget you, I got mine!” and it’s not an attitude I expect here at Loyola. And a thing I’ll ask you to do, because you are more enlightened than this, is to please challenge this when you hear it from your colleagues at other institutions. It’s just so predatory and so wrong.

Now, the way we supposedly get past the paywalls is through libraries, right? Libraries pay for access, publishers get paid for access, everybody gets access, everybody loves the publishers and librarians who make it possible, everybody wins.

Really? Everybody wins? Does everybody in this room feel like they’re a winner under this system? Seriously, I’m a librarian and I don’t feel like I’m winning! Look, let me just lay it out there, libraries are not the fix here, and partly that’s our own fault. Even those who can get into an academic library in the first place—and let’s not forget that’s not everybody—even a ginormous library system like the one where I work doesn’t have access to everything everybody needs. Am I right? So “library as access solution” is just not working, if it ever did, and it’s not going to start working again any time soon, sorry. It’s not that we librarians aren’t trying; we totally are. It’s that the whole game is rigged against us, and we need help and support to fix it. In other words, libraries are lunch. We have been lunch for the last 30 or 40 years, we’re still lunch, and if we’re going to stop being lunch without becoming predators, we have to change too.

Faculty, we need your understanding and your help to do that. When you blaze into your librarian’s office yelling about the latest journal cancellations, you’re being predatory. Instead, ask for the price, the usage numbers, and our cancellation rationale. We’ll show you. Librarians, when you just cave in to the faculty member who blazed into your office yelling about the latest cancellations, you are being predatory. Who are you eating? Other campus denizens who need materials that your library can’t buy because you’re catering to the people who yell the loudest. It’s not fair and it’s not okay, and all it does in the long run is get more people yelling at us because we’ve taught them that’s how to get what they want.

Here’s what we’re up against. Library budgets are flat as a pancake most places, and even where they’re rising, they’re not rising as fast as regular inflation, never mind actual journal-price inflation, which is leaps and bounds higher than general inflation. So libraries have been losing purchasing power by leaps and bounds, and all the money in the world wouldn’t actually catch us up. At the same time, the scholarly literature is only getting bigger! And explaining what happened here would take me more time than I have today, so I won’t—trust me, I could go on for hours; I teach a summer course where I actually do!

I will say that we have all been predators. We’ve all played the honey badger and decided not to care. We have all let this predatory system persist. Librarians signed on to Big Deals that were only ever going to blow up in everybody’s face. Faculty steadfastly refused to pay attention to what was going on except to blame librarians. Administrators wrongly thought that just giving libraries more money would solve the paywall problem. Publishers and aggregators have been playing some seriously ugly and cynical games with non-disclosure agreements and pricing. It’s just fail all around. And here’s the kicker—no, really, this is unbelievable but it’s true, people have done the math here: the money already in the system is plenty enough to do the actual work of publishing the literature. That money is just disappearing into the wrong pockets, not to mention being misspent to prop up paywalls.

Part of the problem is that reward schemes in journal-based academe don’t reward some things that they should. They don’t reward openness, and they don’t reward evidence that people actually read or used or cited your specific stuff. They reward publishing in journals with high Journal Impact Factors, and there are a million reasons Journal Impact Factor is total predation, but one of them is that high impact-factor journals tend to be paywalled. If you reward impact factor, you’re rewarding predatory paywalls. That’s changing, thank goodness, but there’s another reason to walk away from impact factor, too. Who falls prey to impact factor? Who gets eaten for lunch? Younger scholars pre-tenure, that’s who. The stupid pointless inaccurate impact-factor chase destroys promising research careers. That’s evil. And it’s entirely within academe’s control; you can’t blame this one on librarians or publishers.

Oh, except when librarians act exactly like academe—and yes, I happen to know there’s at least one Chicagoland academic library that relies on impact-factor based “prestige” for judging their librarians for tenure. You know who you are. Shame on you. Walk out of here determined to set a better example, please. Look, this goes for everybody, if your tenure and promotion guidelines even mention Journal Impact Factor, if your department uses it to assess your people and their work, I want you to walk out of here today vowing to change that, because it’s predatory in and of itself, and it feeds into other kinds of predation. If you rely on Journal Impact Factor, you are the predator.

It’s these predatory priorities in how we assess scholars that lead pretty directly to a lot of gross scammy behaviors in the journal market. Let me be totally clear here: so-called journals trying to scam author-side fees are totally a thing, but they are not common and not hard to avoid if you’re paying attention:

  • Did they spam your email? Walk away.
  • Home page mention Ulrich’s or Google Scholar? Walk away.
  • Catch them in a lie? Walk away.
  • Not in doaj.org’s list? Walk away.
  • Walk faster if it’s on DOAJ’s list of journals that lie about being DOAJ members.

They’re also not the only predatory scam artists out there. We have drug companies buying up journals to fool clinicians into prescribing particular drugs. We have coercive citation, “cite our journals or we won’t publish you,” which is totally predatory and wrong. We have vanity one-author toll-access journals, I kid you not, and it’s amazing how long they last before anybody does anything about them. And we have giant multinational corporations making 30% to 40% profit margins—profit! that’s pure profit!—telling universities and libraries with limited and still-decreasing budgets “hey, how about another giant price increase this year?” Now that’s predatory.

That article that came out, something like 80 million dollars supposedly heading to scam open-access journals? Trust me, I’d love to see the scammers gone too, but 80 million is a drop in the ocean, folks. It’s a rounding error against the multinational journal publishers’ profit margins. Be outraged! This is outrageous! But train your outrage where it’ll do the most good, okay?

Here’s what a system that doesn’t chow down on potential readers looks like: you find a journal article on the web that looks interesting, you download it or just read it right on the page, end of story. It should be that simple, right? Because who needs the hassle of anything else.

I encourage everyone to take a look at the so-called “altmetrics” movement, which is expanding our understanding of scholarly impact well beyond Journal Impact Factor, in ways that are respectful of the power of open access. Here’s an example, from one of my own articles published a couple of years ago in an open-access journal, and just as a disclaimer I’m now on this journal’s editorial board. You can’t read the text I’ve circled here, it’s too small, so I’ll just tell you that this article went wild on Twitter, over 200 tweets. Who saw them? Well, just adding up the number of followers on the tweeting accounts, you get over 312,000 followers. Now, of course the real number isn’t anywhere near that high because of Twitter follower overlap, and of course not everybody who sees a tweet is going to click through and read the article. But stick with me for a thought experiment here: if one-tenth of one percent of that upper bound, those 300,000 tweeters, found the article this way and actually read it, that’s over 300 readers! Let’s get wild here, if it’s one whole entire percent? That’s 3000 readers! Another article I wrote back in 2008 I put in the institutional repository I was running at the time, so I had access to its download stats. When I left to join the iSchool four years ago, it had something like 14,000 pageviews. Think this kind of math could make a difference to a tenure and promotion committee? I think it might. I surely think it ought to.

The numbers are kind of a sideline, though. What makes me happy about this is that I know for certain that anybody who saw a tweet about my article and was curious enough to click through could immediately download and read it. No paywalls, no nonsense, nothing in the way, exactly as it should be. Now me, I’m lucky enough not to be tenure-track, so I have the luxury of being in the publishing game purely to make a difference. Open access helps me do that. Paywalls would just eat my readership for lunch.

Now, not everybody is going to care about Twitter specifically. That’s okay! Be critical of what’s being measured and what it means. That’s awesome and you should do that; everybody else sure is. Altmetrics is way more than Twitter, fortunately. Just please, make sure you’re turning the same critical eye on the very-possibly-predatory metrics you’re already using, like impact factor. Don’t let that stuff escape scrutiny.

So, on another topic, does anyone recognize this young man? He’s Jack Andraka, who while he was in high school designed a new inexpensive cancer test based almost entirely on open-access research literature. Now, here’s the thing. Do you know who the next Jack Andraka is, in your discipline? Because it’s an educator’s dream, finding and influencing a brilliant new talent, right? Do you know where the next Jack Andraka goes to school? Do you know what journals the next Jack Andraka’s school library has access to? Well, yes, you probably do, it’s coterminous with “open access journals” because be real, high school libraries can’t afford journal paywalls. Do you even know what country the next Jack Andraka will come from? Odds are good it won’t even be the US! And there are lots of countries with, I must and do believe, lots of Jack Andrakas in them, whose access to the journal literature is unbelievably worse than it is here.

I don’t know who or where the next Jack Andraka is either. I also don’t know how many potential Jack Andrakas can’t become Jack Andraka because their research, their inquiry, their curiosity, their drive, gets eaten alive by predatory paywalls. So in a very real way, we don’t get more Jack Andraka-style wins until there’s a lot more open access. I think we as a society want those wins, I really do.

Anyone recognize this young man? His name was Aaron Swartz, and he tragically took his own life after becoming the target of federal prosecution over an attempted mass download of the JSTOR journal database. Now, I don’t want to say that lack of open access is solely or uniquely at fault for how the system chewed up and spat out Aaron Swartz. That would be ludicrously overstating the case. The social and technical and legal structures we’ve built up around scholarly paywalls did contribute, however, and all of us in academe—faculty, publishers, librarians, content vendors—need to recognize and own that part of the fail. I would like this never to happen again. With open access, I sure hope it wouldn’t need to.

Because this kind of predation is still happening, now, today—does anyone recognize this man? His name is Diego Gómez, he’s a dissertator, and he is facing a ruinous lawsuit in Colombia because he scanned and posted a dissertation online that was so important to his work he thought everyone should be able to read it. Now, Colombian copyright law is out of control, it’s even worse than ours, which is bad enough. But even beyond that, Diego Gómez would not even be in trouble in a world of open access! He wouldn’t have to post a PDF of a thesis that wasn’t his, he could just link to it! And again, isn’t it supposed to be that easy to acknowledge and credit our good influences?

In passing, it’s interesting to me that this is happening to Gómez over a dissertation specifically, because it’s exactly with dissertations that a lot of universities and a lot of young scholars are experimenting with open access. And the world hasn’t come to an end, the sky has not fallen, nobody’s been eaten by sharks that I’m aware. If anything, what I usually hear is great stories about young scholars making connections with established scholars over an open dissertation, open dissertations leading to unexpected-but-welcome publishing deals, undergrads and new grad students understanding the research process better because of open dissertations—I couldn’t ask for a better illustration of how amazingly useful and helpful open is.

So here we are. We want more open access because we want a system that isn’t turning willing readers and willing writers into lunch. So… why aren’t we there yet?

It boils down to what economists call—say it with me if you know it—a collective action problem. Basically, there are lots of stakeholders who want to see a less lunchy system. Students. Teachers. Researchers. Librarians. Policymakers and lawmakers. Research funders. Even some publishers, though not all. But no stakeholder can do it all alone, just too many moving parts and way too much inertia in this system for that! So it’s really easy for everybody to just sit back and say, hey, vegetarianism would be great, no more predators would be great, but there’s nothing I can do myself that will make a difference, so let somebody else do something. I won’t say nobody does anything, because we wouldn’t even be here today if a lot of people weren’t doing a lot of things, but a lot of people do nothing, so nobody can actually do enough.

So now what? How do we stop the predation? How do we push past the collective action problem? Well, that’s what Open Access Week is about, really—and when Loyola’s mission says that Ignatius believed in action, taking action to open access to more research is what it’s about.

What it amounts to is, there are a million excuses in the naked academy for not pursuing open access, and I have heard them all. You cannot surprise me; I’ve been doing this for a decade.

  • That’ll never work in my discipline!
  • But tenure! But promotion!
  • Oh, just give the library more money so they’ll shut up.
  • Who’s paying for this anyway?
  • I don’t know what a repository is, but I’m pretty sure we don’t have one.
  • But that’s work, and I’m too busy.
  • Well, my work isn’t really ready yet…
  • OMG copyright!
  • I heard it’s all a scam.

Maybe that’s how you feel. Maybe you are totally willing to make excuses to me when I can’t get to an article I want to read and maybe use in my teaching. Maybe you’re okay making excuses to Jack Andraka, or Diego Gómez! But I’m not, and I don’t think anybody in the academy should be. That is not what we are about here, especially this week. We’re about action.

There’s lots of ways to take action! You can take action as an author, as an editor or reviewer, as a good departmental citizen or a good university citizen, or as a good citizen generally, out there in the world of politics and advocacy. I’m going to make a couple-three specific suggestions, but you don’t have to follow them; they’re just ideas! Got a better way? Go for it! No matter what anybody tells you—and yes, I know there’s a lot of zealots out there and I wish there weren’t too—as long as you’re legal, there’s no wrong way to be open.

No matter where you are or what you do at Loyola, there’s something you can do. At base, there’s this: you can learn and you can teach. Because the collective action problem is real. Nobody can flip the switch to open alone; if it was that easy we’d have done it already. So make sure nobody’s alone! Let’s do this together.

Back in Madison over the summer this year, some folks started up a monthly open meetup—low-stakes, low-commitment, just interested people talking and teaching and learning from one another. It’s not a huge group. Yet. I mean, it hasn’t even been six months yet, these things take time! But it’s the right people gathering for the right reasons and I love that; I’m super-happy about this. (I’m allowed to say that. I didn’t organize this; I just go to the meetings and help out with the Twitter account.)

Could you do this here? Of course you could, and you’re so lucky to be in Chicago, because it’s dead easy to build a critical mass here! Just off the top of my head there’s a half-dozen people right here in Chicago who know their way around this stuff and are great facilitators. So go to it, monthly meetup, hopefully over deepdish pizza because this is Chicago, right? (In Wisconsin it’s about cheese.) You can do this!

You don’t have to stop at, or even start from, open access to the journal literature! Maybe you’re about open licensing, making sure the scary wing of the copyright-holder party doesn’t take over the entire cultural and scholarly universe, hello there Trade Pacific Partnership. If that’s you, I am with you! We need this kind of legal infrastructure to make open access work.

Or maybe you publish books rather than articles. I won’t lie, open access to books is a long way behind journal articles because the economic structures are different and so the solutions have to be too to some extent, but it’s starting to happen; we’re starting to figure out how to pay for it—I’m serious, just in the last month two or three major reports have come out about new models for sustaining monographs—and you can be part of that! I hope you’ll consider it.

Oh, and while I’m on the topic of the humanities, I hear two common excuses for why open-access humanities journals haven’t been happening. One of them is “our journals aren’t expensive! why should we have to change?” And here’s the thing about that: you’re right. You’re totally right; your journal prices are rounding errors. But that doesn’t get you off the predatory hook, sorry, because the big multinationals publishing science journals and making obscene profits are hiding behind you, claiming they’re just the same as you. They’re also eating you for lunch, of course, because the money libraries are paying to the big multinationals isn’t paying for books, but seriously, retire this excuse, okay? You’re just helping to rationalize a predatory system that’s eating you for lunch. Come on, I know you’re smarter than that.

The other excuse is “we don’t have big grants, how will we pay author-side fees?” And that’s based on the weird idea that author-side fees are the only way to pay for open-access journals. That’s not true. That’s never been true. I honestly don’t know why anybody believes this! The Open Library of the Humanities just launched with seven journals (soon, I am told, to expand), and it’s tremendously exciting. If you edit a humanities journal, if you’re thinking about starting one, give OLH a look. It’s open access designed for the humanities, without author-side fees.

Librarians in this room: I am bitterly ashamed to say that I checked the OLH supporter list and I didn’t see one single Chicago-area institution on it. Come on, y’all, if we don’t want to be predatory free-riders, if we’re serious about open access in general and open access to the humanities in particular, this is the best way I know to put our money where our mouths are. When you leave this room—or even before, I don’t mind if you use your mobile tech while I’m talking—make an appointment to talk with whoever does collection development in the humanities in your library. Let’s get some Chicago dollars floating toward OLH and Knowledge Unlatched and projects like that.

Faculty, you can do the same thing! Don’t yell—yelling is predatory and not cool—but suggest to your librarians that these are places you want to see library dollars going. The amazing Leslie Chan of Toronto—great human being, read up on him if you don’t know him—Leslie once suggested what he calls the “one percent solution.” And you can probably guess what that means just from the name, right? Take one measly percent of the library’s collections budget and toss it toward open access; it’ll make a greater difference than you can imagine. So take that home and think it over, too. One percent. Just one percent.

If you’re into open textbooks, open educational resources, we are totally friends, you and I! I am all in favor of keeping Pearson from devouring the universe, because frankly Pearson scares me. This here is the Open Textbook Library from Minnesota, and they’re looking for textbook reviewers from all over, so pitch in! And if an open textbook can work for you, please adopt one; this and OpenStax are great places to start looking.

Or maybe your thing is open data. We are still friends! Open data is my thing too!

Or maybe, librarians, archivists, and humanists, you’re interested in opening up the amazing wealth of unique material in local collections through digitization and open metadata. Great! I am with you! I’m a Community Representative for the Digital Public Library of America, which is trying to do just what I just said, so ask me anything about it after we’re done here. I think this opportunity is amazing, and I’m so pleased that so many libraries and archives and museums and personal collectors are getting involved! So there’s lots of ways to open up more wins. Please learn about them, this week and beyond. I think you’ll find a niche you’ll like.

For now, though, circling back around to the scholarly-journal literature…

If you create academic things? I’m asking you to make one of those things open this week. Just one. If you can do that with a journal article you’ve written, great! Legally, please—if you need help figuring out where you are with copyright, the librarians here or wherever you are can help you. But look, it doesn’t have to be a journal article, it can be anything academic you’ve made and hold rights to that would help other people. Normally I’d tell you I will pledge this alongside you, but I have a little bit of a problem there, namely that everything I’ve ever published that I hold rights to is already open, one way or another, and a whole lot of other stuff I’ve done besides.

Come join me in making work open. I’ve done some regrettable things during my career, but there’s one thing I have never regretted for an instant, and that’s making my work open. It has opened so many doors, created so many opportunities for me, introduced me to so many amazing people, some of whom have said to me, “I only got to read your article because it was open access, and it really helped me, so thank you.” What’s to regret about that, right? So join me. Jump in, the water’s fine and there are no sharks in it! Let open work for you too.

Now, you need to put your work someplace where it’ll stay there, okay? Not just any old web server—anything can happen to them and usually does—but someplace that somebody beyond you has a commitment to and will take serious care of. Time is the great devourer, especially of all things digital, and we’ll all retire someday, right?

Here is your library’s commitment to making your work open for the long haul, the long now, tomorrow and tomorrow and on into the future. This is Loyola University Chicago’s eCommons, and it’s run by your librarians here at Loyola. So I’m asking you to join your librarians in their commitment to making Loyola a more open place. To that end, here’s that goal for you again: by the end of this week, I want everybody in this room, every single one of you, to add your one thing, whatever it is, to Loyola eCommons. Can I get the Loyola librarians in this room to raise their hands? Okay, thanks. These are the folks you go to for help putting in your one thing.

One more thing. I’d like each of you to sign up one other person at Loyola this week. Just one. And convince that person to put one piece of work in. Just one! Because each one teaching one, that is how we get past the collective action problem.

As a department, I need you to fix your tenure and promotion practices, especially if they depend on Journal Impact Factor, which is utterly predatory. And humanities folks, if you’re still so fixated on print books that you can’t even imagine worthwhile scholarship in any other form, I don’t even know what to say to you except stop that, it’s 2015, come on! The Modern Language Association has guidelines for you; it’s time to implement them.

As an institution, it’s Loyola’s job to dream big, set the big goals. Here’s the one I’d suggest, just for starters: “Loyola wants no one interested in Loyola research
to be turned away by a paywall.” Then figure out how to get there! I know it’s tempting to just be reactive, or even to resent all this because you don’t want it imposed on you. I’m asking you today to be bigger people and a better university than that. Own open access. Own it yourselves, as an institution. Adopt this goal, or one like it, and own it, and make it happen.

That will take a lot of talking and a lot of work—that’s just life in the academy—but others have walked this road before you, and plenty of them are happy to walk alongside you, because we all win when we’re not being eaten alive by paywalls. I believe Loyola can do this. This is not something I say about every institution—I’ve been doing this for ten years; it’s been hard and I’m incredibly cynical, sorry. But Loyola has the right mission and the right people and the right preparation. I look forward to seeing what you accomplish!

Thanks again for inviting me, and I’m happy to answer questions!


Who’s Devouring Whom? Introduction

Open access to the scholarly literature stopped being my job in 2011. I have yet to escape Open Access Week, however. A couple of years ago I had to tell some of our local students gently that no, I wouldn’t do yet another Open Access Week talk at UW-Madison as it made no sense for them or for me to have me do so; everyone there who ever wanted to hear what I have to say about open access already has.

I am still a useful second-string speaker at institutions that (for whatever reason) don’t spring for one of the top names, it seems. In contrast to my usual modus operandi, I don’t write a new talk every year for Open Access Week; instead, I have a mostly-stock talk that I adapt to the institutional context where I’m going.

Loyola University librarians told me that so-called “predatory publishing” was a concern among Loyola faculty and administrators. They also hoped for fewer electronic thesis embargoes and more (and more willing) participation in their institutional repository. I told them I’d try to speak to those issues, and I did try.


Quia Faciendum Est

Thanks for that gracious introduction, I very much appreciate it! I’m Dorothea Salo, and I am not nor have I ever been a cataloger, though I may be one of the last generation of library-school students required to take a full course in cataloging, I don’t know. I just know we at the iSchool are doing a major curriculum revision, and current odds are that we’re keeping the core organization-of-information course, just changing it to de-emphasize MARC cataloging in order to include more material relevant to non-MARC environments. Interesting times, not that they’re ever not.

Curriculum revisions are never fun. They eat everybody’s time for months on end, they always cause bureaucratic hassles out the wazoo, and for a while you have to deal with two different incompatible curricula and remember which rules apply to which student you’re advising… it’s a mess. Why do we do this to ourselves? Why do we bother, if it’s such an awful hassle?

My answer is quia faciendum est, “because it must be done.” (Just in Latin it sounds cooler.) We at the iSchool can’t just sit back and do what we’ve always done because we’ve always done it that way, not when the world our graduates will need to fit into is way different from what it was when we built the old curriculum. It doesn’t mean we did a bad job on the old curriculum; I don’t think we did! It’s just that the world has changed out from under it. So we have to change the curriculum. We don’t have to enjoy it. We just have to do it, because it must be done.

Not coincidentally, that’s pretty much how I feel about the move away from MARC. It feels to me like a lot of the library profession has spent half a decade now, at least, on the question cur faciendum est? or “why must it be done?” And even though this question sounds really cool in Latin, I have completely run out of patience with it. Don’t even tell me nobody’s asking this still. I straight-up heard it just last May at a conference; it’s totally still out there floating in the water.

You like water? Here’s some water in a nice harbor, really pretty, love the lighthouse, but the thing that isn’t in this picture of a pretty harbor is a ship. Because the “why can’t we still use MARC?” ship has sailed, people! It has sailed. I’m not even having the why-do-we-have-to-change discussion today, I honestly don’t see the point, that ship has sailed.

I’m next year’s program planner for the IT Division of Special Libraries Association, and I was talking to SLA’s technical-services planner Betty Landesmann—some of you probably know her—I was talking to her about a linked-data session, and she rolled her eyes at me and said “can we not do another intro to linked data and why it’s better than MARC please? I’ve seen a ton of those and they don’t help.” Okay, then. If I’ve got catalogers yelling at me not to do this, I won’t do it! Because really, the answer is exactly the same as it is for our curriculum revisions at the iSchool. Quia faciendum est. Because it must be done.

I do want to mention, though, because I come at technical services from an XML-and-metadata background rather than a MARC background, that it isn’t just MARC cataloging the bell is tolling for here. There’s zero chance XML-based metadata practice will stay the way it is today; I already see it changing. I’m not even sure XML will stay alive as a pure metadata format, as opposed to uses like TEI for the digital humanities and EAD for archives, where you’re dealing with narrative-type documents intended mostly for human beings. I’m okay with XML’s decline as a metadata serialization. I never liked my nice elegant document standard getting worked over by the data engineers anyway—do not even talk to me about XML Schema; it’s just horrific. Maybe now I can have XML back. For documents. As it should be.

No, I’m much more interested in this question: quid faciendum est; what actually is it that must be done? What do we have to do to our catalog data and metadata so it works in this world where so much has changed about how we find information? I like this question because it’s pragmatic, I like it because it’s intriguingly complicated, I like it because it’s nerdy in all the best ways, I like it because I am an inveterate fiddler-with-things and there’s just great huge masses of MARC and XML right there to be fiddled with… and it’s another of those questions we have to work on or we just stay stuck, right?

I don’t think it’s enough to just say “well, we have to migrate our data from MARC and MODS and METS and the various Cores—Dublin Core, Darwin Core, VRA Core, PBCore and so on—we have to migrate all that to linked data.” That’s skipping all the steps! That’s like saying “we have to pick up some rocks and turn them into a giant mosaic.” Whoa, wait, not enough information! What’s our mosaic design? Where will the mosaic be built? Where do we find the right-colored rocks, and how many rocks of each color do we need, and how do we cut them down if they’re too big or not the right shape? How do we glue the rocks down? What if somebody makes a mistake? What if there’s an earthquake?

Process. That’s what we need here. Some process, right?

The other reason I don’t think it’s enough to just say “well, we have to migrate our data from what we have to linked data” is that it assumes without proof that linked data is the ultimate destination for it. It’s the horse to bet on, I’m not saying it isn’t, but I just think linked data gets used as a stalking horse sometimes, a scapegoat. It’s linked data’s fault we can’t use MARC, it’s linked data’s fault all these changes are happening, if it weren’t for linked data supposedly being the new hotness we could stay the way we are and everything would be fine.

I don’t actually think that’s true. If linked data didn’t exist—and let me tell you, I have lots of days I’d love to wipe RDF off the face of the earth—if there were no linked data, we’d still have to make changes in how we collect and organize our catalog data and our metadata. We have to make those changes for the same reason we’re changing the LIS curriculum at the iSchool: the world has just plain changed out from under the old ways. That didn’t happen when the Library of Congress or the British Library announced their linked-data plans. It happened long before that. It happened when paper cards gave way to the Web as the main way patrons interact with library catalogs. It’s not that we didn’t notice; of course we did. It’s just taken us a while to figure out what we need to do about it. I don’t know how to say “oops” in Latin, but oops, I do think we maybe waited longer than we should have. Water under the bridge now.

That leads me right back to quid faciendum est? What do we have to do, now that the work we do has to play nicely with computers? And not just computers—MARC was designed for computers—but networked computers, computers that can talk to one another. The network really does change the game.

When I teach our core organization-of-information course, I start my students with this thought: ordinatra stulta sunt. Computers are not all that bright, I tell my students; you’re way smarter than a computer. I do this for a lot of reasons. One reason is knocking computers off pedestals—not actually literally knocking computers off pedestals, though hey, that would be kind of awesome—but you know what I mean, right? A lot of my students come into the iSchool thinking that computers are like unto small gods, magical and capricious and liable to mess you up, impossible to understand much less work with. I’m saying, I have to get them to not think that, because the more they understand about how computers do work, the better off they are, and the better off we all are.

But the main reason that the notion that computers aren’t real bright is relevant to this talk today is that from our point of view as literate human beings, computers are not-too-bright in some very specific and fairly easy-to-understand ways, and those ways tell us pretty clearly what our catalog data and metadata have to look like if we want computers to work effectively with it. Really this is no different from how the shape and size of catalog cards and the standard size of typewritten lettering shaped how the MARC record had to look. The technology you have available—and the card catalog is totally a technology, don’t let anybody tell you it isn’t—the technologies available to you shape how it makes the most sense to do things, because different technologies are good and bad at different things and need different things to function best. That’s basic design theory; read Donald Norman’s Design of Everyday Things and its sequels if you haven’t already, it’s brilliant stuff.

So the first thing to remember about computers is that text, the ordinary stuff we write for other people to read—the text we literate human beings read and comprehend so fast and easily that we hardly have to think about it—it’s all Greek to a computer. (You knew that cliché was showing up at some point, right?) Computers can’t read; they are functionally illiterate. If anybody in this room has a kindergarten-aged child at home, that kindergartener most likely reads and comprehends text immensely better than a computer can.

So, in my head, one step toward coping with illiterate computers is dealing with our addiction to textual notes. By way of example, I took some MARC 504s straight from the Library of Congress’s MARC documentation—thanks for that, by the way, if anybody here is responsible for it, it’s super-helpful in my classroom:

  • 504 ##$aBibliography: p. 238-239.
  • 504 ##$aSources: p. 125-152.
  • 504 ##$aIncludes bibliographical references.
  • 504 ##$aIncludes bibliographies and index.
  • 504 ##$aChronological list of the author’s works: p. 469-475.
  • 504 ##$a”Selected bibliography”: v. 1, p. 351-358, v. 2, p. 234-236.
  • 504 ##$a”Spisok izdannikh knig”: p. 277.

So, for example, if a patron question we would like our catalogs to answer is “hey, I’m new to this topic, can I get a recent book with a good bibliography please?” these notes being free text means our catalogs can’t answer that question, because to get to an answer means filtering a list of books by whether they have a bibliography or not, and to do that with MARC notes, a computer has to understand that “bibliography” and “bibliographies” and “bibliographical” and “literature cited” and “sources” and maybe whatever that romanized Russian means—I didn’t look it up—the computer has to figure out that they all mean yes, there’s some kind of bibliography! And no, a computer can’t just look for the existence of a 504 and assume there’s a bibliography, because some 504s don’t say anything about bibliographies, just indexes.

Look. The computer is not bright enough to figure this out. It can’t read, much less read all the languages we transcribe stuff in, much less comprehend what it reads. That makes a lot of the stuff in our MARC records a lot less useful to patrons than it could be.

Computers do understand yes or no; that they’re real good at:

  • ☑ Bibliography
  • ☑ Index

Checkboxes are candy to computers. So for any conceivable criterion we want our patrons to be able to filter their catalog results on, we pretty much have to quit recording it in text and make it a checkbox—or radio buttons if there’s more than two options; that works too. If you get the sense from this that I like MARC fixed fields, you’re right! (Though honestly, that festschrift thing? That is just weird.) If a lot more of MARC had been expressed in fixed fields instead of free text, we’d be a lot better off right now.

Now, let’s say for a moment that the titanic arguments we in the profession will have to have about when a book can be said to have a bibliography, and what counts as an index… let’s say those are over, and we’ve drawn the best line we can. We still have to deal with this giant horrible mass of free-text notes in our existing catalogs that computers aren’t bright enough to understand. Faciendum est, people, it’s got to be done. I don’t want to get down in the weeds on this—actually I would totally love to because this is exactly the kind of problem I salivate over, but I’m a giant nerd and not everyone is like me in that. So I just want to say, this class of problem can be solved for the great mass of our records without hand-editing, and of course it must be, because there ain’t enough staff time in the universe to check all those checkboxes. And it’s a thing that must be done for every characteristic in our notes that we want users to be able to filter or search on.

Another serious and ugly free-text problem we have in our records has to do with places where our content standards don’t force us to be consistent about how we record certain information. Quoth the mighty AACR2r, “You know, whatevs. We’re cool. Just type something.” On catalog cards this inconsistency didn’t matter, because the information was only ever going to be skimmed by a human being, who doesn’t need a whole lot of consistency. We’re literate humans; we’re smart about figuring this stuff out. Computers are astoundingly literal-minded, however. You can take some text, add a space to it, and to the computer it’s suddenly something totally different. Just one space, that to a human isn’t even visible!

Once again, this isn’t just a MARC thing. Consistency in Dublin Core metadata? Don’t make me laugh. Actually, I’m more likely to cry. It’s bad out there, people. I mean, I once got an entire published article out of one poor soul in the institutional repository I was running at the time—poor guy had stuff under eight slightly-different spellings of his name. Just terrible, and yes, I fixed it as soon I had the screenshot I needed for the article—of course I did, I wouldn’t just leave it there in that state—but it goes to show.

I want to call out two Dublin Core things specifically, noting that you’ll find these problems many more places than just Dublin Core. Dates, oh my goodness, dates. Dates are really important to information-seekers, so it’s really important that we record them consistently such that a computer can reasonably intelligently filter based on them. We are not there yet; we’re not even close to there. Dublin Core, MARC, it doesn’t matter, people who try to make computers work are tearing out their hair about the ways we do dates, and we have to fix it. Faciendum est, we have to fix it.

Rights statements also come from the digital-collections side more than the catalog side. It’s important because we have to be clear about what our users are allowed to do with our digital collections, and to do that, our search engines have to be able to tell which users can do what with which items, and that’s all free-text now and it’s a total mess. Europeana and the Digital Public Library of America are working on it, thankfully, though I don’t envy them that job one bit.

An example of the ugliness of free text that I use in class a lot is from library software developer Bill Dueber, who took a close look at what was after the ISBN in the 020 field in the catalog he was working with. It’s horrific. Just the top twenty responses by number of occurrences, you can see the inconsistency:

  • 402537 pbk
  • 387406 alk. paper
  • 99260 v # (e.g., “v. 1”, “v. 22”, etc.)
  • 82918 cloth
  • 51125 hbk
  • 42036 electronic bk
  • 41360 acid-free paper
  • 38792 hardcover
  • 28913 set
  • 20358 hardback
  • 19160 ebook
  • 16264 paper
  • 15269 u.s
  • 12770 hd.bd
  • 11793 print
  • 10625 lib. bdg
  • 10520 hc
  • 8772 est
  • 7767 pb
  • 7639 hard

The more you drill down, the worse it gets. Dueber looked for everything that might reasonably represent a hardcover book:

hard | hard adhesive | hard back | hard bd | hard book | hard bound | hard bound book | hard boundhard case | hard casehard copy | hard copy | hard copy set | hard cov | hard cover | hard covers | hard sewn | hard signed | hard-backhard-backcased | hard-bound | hard-cover | hard-cover acid-free | hardb | hard\cover | hardbach | hardback | hardback book | hardback cover | hardbackcased | hardbd | hardbk | hardbond | hardbook | hardboubd | hardbound | hardboundhardboundtion | hardc | hardcase | hardcopy | hardcopy publication | hardcov | hardcov er | hardcovcer | hardcove | hardcover | hardcover-alk. paper | hardcovercloth | hardcoverflexibound | hardcoverhardcoverwith cd | hardcoverr | hardcovers | hardcoversame | hardcoversame as above | hardcoverset | hardcovertion | hardcver | hardcvoer | hardcvr | harddback | harde | hardocover | hardover | hardpack | hardpaper | hardvocer | hardware | hd | hd bd | hd. bd | hd. bd. in slip case | hd. bd.in sl.cs | hd. bk | hd. cover | hd.bd | hd.bd. in box | hdb | hdbd | hdbk | hdbkb | hdbkhdbk | hdbnd | hdc | hdcvr | hdk | hdp | hdpk | hradback | hradcover | hrd | hrdbk | hrdcver | hrdcvr

So our catalogs can’t answer the very simple question “yo, this book, print or electronic or both?” At least not based on the 020—and yes, I know RDA fixes this and I’m pleased about that. Bottom line, though, a lot of catalog data is hopelessly internally inconsistent. Sometimes that’s material for patrons and sometimes it isn’t, but when it is, this is my call to control all the things. All of them. Anything useful in a record that isn’t actually transcribed off the item needs a controlled vocabulary, or other appropriately-standardized expression if it’s something like a date. I cannot with this nonsense, and neither can computers. “Whatevs, just type something” is not okay in 2015. Transcribe it or control it; there is no third option. Faciendum est. Oh, and since I’ve said the word “transcribe,” let me just say, intentionally transcribing typos and other errors in information that’s material to a patron’s searching and browsing is completely mindboggling to me. We have to fix that stuff, and stop propagating mistakes. Consider it a service to publishers, as well as our poor patrons.

I picked on the 020 field for another reason too, having less to do with cataloging practices and more to do with ISBNs. Now, I know I can’t fool y’all the way I fool my org-of-info students with the question “is the ISBN a good book identifier?” We know it’s not. We know lots of books don’t even have ISBNs, and sometimes ISBNs get repeated for different books. It’s not totally clear what a “book” even is in ISBNland; it’s kind of an edition but not really, it’s kind of a format question but not really, and it’s all very confusing. Perhaps predictably, it’s confusing to computers too. Computers need to be really super unambiguous when talking about what kind of thing something is—if you and the computer have a different definition of what a “book” is, the computer is going to do random unexpected and unpredictable things from your point of view. The computer is happy to use whatever definition or definitions we’re happy with, the computer doesn’t care… but in spite of FRBR and sometimes (it must be said) because of it, we don’t really have clear definitions here that don’t lead us into logical contradictions or bad edge cases.

So that’s one thing. We have to figure out what exactly we’re talking about when we say things like “book” and “ebook” and “hardback” and so on, so we can explain the distinctions clearly to the computer… and if this reminds you of Suzanne Briet trying to explain when an antelope is a document and when it isn’t, I am right there with you! It’s totally going to be weird and sometimes theoretical like that.

Once we know what kinds of things we’re talking about, we have to be able to point clearly and unambiguously at every single example of these things that we—we collectively—have, so that it’s easier to pool the information about these things that we collectively have. The network can work for us if we let it—all of us know more than any of us about what we all have—but to let the network work, we have to have a common way to point at things. And for a computer, that means an identifier that (unlike an authority string) never, ever changes. And for a networked computer, that means an identifier that’s unique not just in your organization—so no, your call number or barcode number won’t work—but unique worldwide, so it absolutely cannot be language-dependent.

We’ve tried collating our records by fuzzy-matching and deduplicating; that’s how metasearch worked. But we pretty much all know that metasearch never worked real great. Computers aren’t bright enough to fuzzy-match well, and catalog data and metadata are sparse enough that they’re not good candidates for that approach to begin with. We’ll still have to use it to assign identifiers to start with, because we don’t have anything better, but it’ll be a long haul for some stuff and some catalogs. So that means we need unique identifiers for our stuff that are way more reliable than ISBNs. If you know the linked data world at all, you know that the scheme that’s been settled on for these is URIs, which mostly look like URLs. The reasoning there is that we already know how to make URLs globally unique, because they already have to be or the web doesn’t work. That’s all it is, just keeping the computers from getting confused.

Another problem we have to solve if we’re going to take advantage of the network is taking identifiers and quasi-identifiers that we’ve been using for things, like ISBNs and authority strings, and matching them up with URIs/URLs that have been established for those things. Again, I’m not going down in the weeds here, but I do want you to know (if you don’t already) that the nerd-word for this process is “reconciliation” and it can be partially automated if you know your source data well, as catalogers generally do.

Once you have a URI for something, you can go out to the network and ask a whole bunch of trustworthy sources “hey, what do you know about this thing?” and get back useful answers. To me, that’s how what we now think of as copy cataloging is going to work. Ask tiny questions of various reliable sources, get tiny answers, build them up into search indexes and facets and browsing tools and all the other UI chrome we’re already familiar with. It won’t have to be done by hand, if you tell a computer “every time I feed you a URI for a book we’ve bought, ask this question of that source in this way and store the answer there” it will happily do that, reliably and consistently, every single time. I strongly believe this will be a much better solution to what I think of in capital letters as the Problem Of Vendor Records. You’re familiar with this problem, I don’t have to elaborate, right? What I’m saying is, the Problem of Vendor Records is nine times out of ten a problem of vendors struggling not only with MARC and AACR2, but with MARC practices that are incredibly inconsistent across libraries. It’ll be a lot easier for us and for vendors if instead our computers ask their computers a lot of tiny questions with tiny answers.

That leads to one last thing about the 020, okay? ISBNs aren’t unique, the field includes inconsistent format information, yeah, yeah, we got that. Here’s my question: what the everliving heck is format information doing in an ISBN field to begin with? Much less information about volumes of a multi-volume series? And if it does have to be there—and I know, I know, I do understand why it ended up there—why isn’t it at least in a separate subfield? What’s this nonsense with parentheses?

I’m showing you a mosaic detail here for a reason. You can see all the teensy-tiny individual rock bits here, and you can see how carefully they’re placed, and that none of them actually overlap anywhere. That’s what our catalog data and metadata should look like. No overlaps, nothing jammed together, everything in tiny tiny bits and each tiny bit in its own singular place, very carefully set apart from all the other tiny bits. That’s called “granularity” or “atomicity,” and computers love it. Computers are really really good at building up whole mosaics from tiny granular pieces! What they’re critically bad at is taking a whole and breaking it into parts. We have to do that for them. And as we saw with the 020 field, we often don’t, or when we do, we do it in ways that the computer finds confusing and inconsistent. Here’s a relevant factoid to take home with you: Computers cannot reliably and consistently split human names and titles in human languages into their component parts. Naming is just too inconsistent across human cultures and languages for that to work. With pre-split names, though, it’s relatively easy to write rules to put the names back together intelligibly.

So I grabbed a couple of examples from IFLA’s ISBD supplement—and again, thank you for this, if anybody’s here from IFLA, this is a classroom life-saver—just to show the difficulties:

Object (cartographic ; visual) : unmediated

Scanglobe diameter 30 cm / by Replogle. — Scale
1:42,000,000. — [Chicago, Illinois] : Replogle
Globes, [2006?]

1 globe : col., plastic ; 30 cm (diam.)

Relief shown by shading and spot heights. Depths shown by shading and soundings. — “Scanglobe” is a trademark. — Mounted on spindle crowned by a plastic clockface, in a plastic meridian half circle, on plastic base. — Globe lights up from inside by means of electrical cord with switch button and interior bulb

Try to think like a computer for a second. There’s a whole lot of punctuation all up in here, and it’s not at all obvious (even to me, and I’m a human being) what’s being set off by it or what it means, or even if it means anything at all! I mean, look at the physical description of the globe there. Tell me what a period, a dot, means there. As a human being, I can figure out the dots there are calling out abbreviations. Now, can I just tell the computer to assume a dot always means an abbreviation? Of course I can’t! A dot doesn’t mean that in the other areas! And can anybody tell me why in the last area of the globe description, everything except the final sentence has a period at the end? It’s enough to make a computer programmer cry into her beer.

I lopped the area labels off for legibility, but I just want to point out, we have two competing sets of delimiters happening here—the areas, that are set off with whitespace, and what’s in each individual area, which is funky punctuation city. And when you add that to MARC, we’ve got a whole ’nother set of delimiters in the form of fields, subfields, and indicators. I respect my colleagues who teach cataloging! I could never do it! Because I cut my teeth on XML, where delimiters are totally cut-and-dried and straightforward, such that I find this mishmash completely bewildering!

A lot of my students find learning HTML and XML frustrating, because they’ve never had to be one hundred percent consistent about delimiters before. So they make tiny little mistakes like leaving off an angle bracket, and they haven’t learned to scan for those mistakes yet, and they don’t understand what the validator that does notice problems like that is trying to tell them, and it’s really frustrating for them. What I tell them is, suck it up and deal. Okay, no, I’m not actually that evil about it. I’m pretty careful to point out the kinds of errors that beginners usually make, and I tell them that everybody makes those errors, even really skilled and experienced people, and it’s okay, the whole point of validators is to help us get it right. But fundamentally, they have to learn to deal, and so do we. They don’t like that necessarily, and we don’t have to like it either, but too bad, faciendum est, it must be done. Don’t confuse the computer, folks! Reliable and consistent delimiter use is how we avoid confusing the computer. Delimit in just one way, delimit clearly, delimit unambiguously.

Even our XML-based metadata standards aren’t necessarily doing that, much less our MARC-based catalogs! MODS has composite fields and weird ISBD punctuation. I have to share this thing that broke my heart—I was digging into the history of MARC and ISBD internationally, and it turns out that the Germans were totally bent on killing ISBD punctuation out of MARC and relying on MARC delimiters only, which from the point of view of 2015 totally would have turned out to be the right decision, but English- speaking MARC went the ISBD direction instead, and just argh, 20/20 hindsight. One more mess we now have to clean up.

As a recap, here’s what we have to do: We have to get a handle on our free-text issues. When we’re saying the same thing, we need to say it the same way every time. We need to atomize our data, make it as granular as it can be. The delimiter thing, sometimes we have too many and sometimes we don’t have enough, and we need a happy medium. And when we can identify something as well as labeling it, we should, because identifiers make computers happy and useful.

Just to reiterate, we don’t have to do these things Because Linked Data. We have to do these things Because Databases, and Because Search Engines, and Because Faceted Browsing, and Because Internet, and Because Web. Basically, Because Not-Too-Bright Computers. Now, no secret, cards on the table: if we do these things, we’ll be one heck of a lot closer to having linked data. So it’s not like I’m ignoring that, I’m just saying linked data is not all this is about.

Well, that’s great. Plenty of work for catalogers, right? How do we actually do it, exactly? Quomodo faciendum est?

“By hand, one record at a time,” is not the answer here. It can’t be. I mean, yes, there will still be weird outliers that we end up fixing by hand—there always are, I could tell you stories—but we need to throw computers at fixing the easier problems and limit our handwork to those weird outliers. We have too much data to do it any other way. I’m not sure that doing it organization by organization is the way either. In my head, that means a lot of the same problems are getting solved redundantly in parallel. That costs too much and takes too long.

But the truth is, we collectively don’t yet have the know-how we need to collaborate on this. Most computer programmers don’t have the MARC knowledge, and many catalogers don’t know how to work with records en masse. So this problem? Solvitur discendo; it is solved by learning. Catalogers, developers, we all have some learning to do. I would say that, right? I’m a teacher. But I’m a teacher because I believe this.

Any managers in the room, supervisors? Give your people time and space to learn; do not make me yell at you about this! Your cataloging backlog or the new tech thing your developers are working on is far less important than your strategic preparation for what’s barrelling down the pipeline at you, okay? By all means hold your people accountable for actually learning as opposed to complaining about learning—again, I could tell you stories, but you can probably tell me stories too—but let them learn. Help them learn. Learn yourself, it won’t kill you and might make you stronger.

Here are some tools I think are well worth adding to your toolkit if they’re not there already, because they’re designed to fix stuff in lots of records at once rather than one-record-at-a-time. I’ve ordered them by the order in which I would recommend that a cataloger learn them. The last half you may not even need; I mention them because there are situations where they’re genuinely going to be useful.

  • MARCEdit
  • OpenRefine (try LODRefine)
  • Regular expressions (try regexone.com)
  • SQL (if you have to extract data from a relational database)
  • XSLT (if you have a lot of XML around)
  • Catmandu/Fix (don’t start here, Here There Be Yaks)

How should you learn, and what should you do with what you learn? In my head, those questions are intimately entwined. In some circles I’m known for the phrase “beating things with rocks until they work.” If you use mosaic rocks this probably won’t work, but you get the idea. I think plain old mucking around and breaking things and fixing them is the best way to learn new tools, myself—it’s certainly how I do it, and it’s the method I have my students use. Pick something to do with the tool, then do it, and if a few rocks get beaten on along the way, it’s all good.

What should you do exactly? Well, try cleaning up your catalog data! Go fix your 020s with MARCEdit, see if you can at least make those format notes more consistent. Export your data into something OpenRefine can read—it’s doable, I’ve talked to catalogers who did it—and see if you can cluster your 504s such that you can figure out in the majority of cases if there’s a bibliography, if there’s an index. Try fixing your dates, there are all kinds of fun and interesting problems you’ll run into and have to come up with some way to solve. You know your catalogs way better than I do; you know where the worst problems are. You also know where the most important problems to fix are, which I think is totally crucial knowledge here. So learn the tools by fixing the problems you already know are there.

We’ve got a lot of work to do, and it’s got to be done, so let’s jump in and do it. Thanks very much, and I’m happy to take questions.


Quia Faciendum Est: Introduction

If there’s a harder speaking task than talking to catalogers about the imminent death of MARC (and that death is imminent), I don’t know what it is. It’s such a hard job that I found myself resorting to what for me is comfort art—Roman mosaics, which I love.

I can’t say I achieved miracles with this talk, given for the Potomac Technical Processing Librarians’ annual symposium; one attendee tried assiduously to convince me of the one true MARC way afterwards. I do think I managed to come across as non-threatening, though, which may well have been the best outcome possible.