Who’s Devouring Whom? Preventing Publishing Predation

So hi, I’m Dorothea Salo. I’m a librarian, and I teach in the Information School at the University of Wisconsin at Madison. Thank you so much for inviting me to share Open Access Week with you here at Loyola! It’s just about my favorite week of the year. I’m happy to come to Loyola specifically because of how long you’ve been paying attention to open access in general and Open Access Week in particular. I looked at your history of Open Access Week programs; you’ve had some amazing people here and I’m honored to become one of that number this year. Seriously, Loyola should pat itself on the back—no, I mean it, go ahead, pat yourself on the back—lots of places aren’t nearly this interested or this aware.

The first thing I want to do is validate some feelings that I suspect are being felt in this room. Lucky us, we’re sitting in a time of rapid and quite titanic change in how scholarly conversations happen and how research is published and how those publications are circulated and recorded for posterity. It is seriously a jungle out there! And since so much about an academic career is rooted in publication, all this change generates a lot of career anxiety. It’s like the world is changing out from under us, all the rules we’re familiar with are in flux, and what are we supposed to do, right? If you’re anxious, I want to tell you you’re not alone. Of course you’re not, how could you be? So I want to say, this stuff is not changing at you; it’s just changing. And I hope by the time I’m done today you start to understand some reasons that it’s changing, and even why you might want to help that change along.

I organized this talk around a question: who’s eating whom? If you think I chose this theme because so-called “predatory publishing” is back in the news, you’re absolutely right, congratulations! I totally intend to talk about that. But what I really want to do today is expand your sense of what predatory behavior in scholarly communication is, okay? Because a lot of past and present predation—and there really isn’t a better word for it, I like the word “predation” a lot here—tends to slide under a lot of people’s radar. You don’t see it happening, you don’t see its victims, you don’t think about its effects on your colleagues, on your students, on your discipline, on research generally, on the world generally. Lots of things just become “the way things are,” no matter how predatory they actually are if you look at them closely. I know we have some librarians in this room, so I just want to say, don’t get smug, librarians. We like to think this is our turf, we’re awesome, we get this right. But some of us are as clueless as the most newbie graduate student ever, and a lot of us are a lot lazier about open access than we ought to be, and we need, as a profession, to do better than that.

Now, the danger of framing things in this predator-prey way is, everybody in scholarly communication is feeling like prey right now! Right? We all feel like minnows being chased by sharks here! There isn’t anybody in this system who doesn’t feel like they’re lunch. So for the duration of this talk, I’m asking everybody in this room to entertain the notion that maybe, just maybe you too are a predator. Maybe some of the things you yourself do hurt others in the system. Maybe you enable predation—maybe you yourself don’t eat the cute minnow, but you lead the shark to it, or you chase it a lot so it gets tired out and a shark eats it. I don’t want to beat this metaphor into the ground—um, probably too late there—but I want us to recognize and own the things we ourselves do that cause harm to others in the system. I want us to believe that maybe it doesn’t have to be this way. I want us to affirm today that scholarly communication is our system and we can change it for the better—maybe we can go vegetarian, as it were. Borrowing a line from Disney and blessing fair use for my ability to do that: Fish are friends, not food!

In other words, no honey badgers here today, okay? For those not up on their internet memes, honey badgers are incredibly predatory, and they famously don’t care about anything or anybody. Honey badger don’t care! Come on, I’m from Madison, badgers were inevitable. But no, no honey badgers here; Loyola really does care. I took this straight off your website, the mission page: “a profound commitment to the poor”—including, I hope and believe, the information-poor—“and to issues of social responsibility and justice. Ignatius believed that it is imperative to act upon what is learned—to use our values in service to humanity.” Wow, I love this. I love it so much. It is beautiful and brilliant and very librarianly and I love it. I want to add this to it: anybody acting as a publishing predator has lost sight of social responsibility, justice, and service to humanity. I firmly believe this. And we’re all predators sometimes, right? I just said that. So we all have some work to do to reconnect our publication-related behaviors with this mission. Here’s a few examples of what I mean.

Here’s predatory behavior for you. Want to read this article? Fork over 30 bucks to get past the paywall, buster. This one actually hit me personally. I wanted to read this article because it’s dead on point for a course I teach, but our library doesn’t subscribe to this journal. I didn’t want to read this article so badly that I was going to pay 30 bucks for it, much less make my students do so, so I just didn’t read it. Paywalls are predatory behavior! My students and I fell prey to this particular one. Chances are, most people in this room have fallen prey to other paywalls.

But I want to suggest also that the authors of this article, Marjit, Sharma, Sarkar, and Krishnamurthy, got turned into lunch by this paywall. They didn’t write this article for me not to read it! They don’t get anything out of keeping it away from me and my students! The authors are certainly not seeing any of that thirty bucks! They want to be paid in intangibles like reuse and citation and credit, but because I can’t actually read this, I can’t reuse it, and I won’t have any reason to credit or cite the authors, right? So the authors become lunch too. Every time the normal cycles of teaching and research and learning get eaten up like this, it makes me sad and angry. Maybe you too? It’s not okay.

On Twitter there’s this little acronym, “tl;dr,” that stands for “too long, didn’t read.” I’ve seen a couple of knockoffs of that acronym calling out paywalls. This one, “bp;dr,” stands for “behind paywall; didn’t read.” I’ve also seen “pw;dr” for just “paywall, didn’t read.” Every time somebody doesn’t read something because of a paywall, there’s predatory behavior somewhere, somebody’s becoming somebody else’s information lunch. And speaking of Twitter, the #icanhazpdf hashtag is also a thing, and in my book it’s just more evidence of how broken everything is. However you want to abbreviate or hashtag it, this is not okay. Nobody wins when people who care about our research get turned away by a paywall! And this happens all the time. It happens to us here in academe. It happens to our students. It happens to K-12 educators and their students. It happens to public librarians trying to guide people to good information. It happens to practitioners out in the world trying to apply research to real-world problems, it happens to policymakers, citizen scientists and citizen humanists, grant agencies and grant reviewers, journalists… all kinds of people with all kinds of legitimate interest in reading research!

And it shouldn’t happen. This is not service to humanity, this is not care for the poor, this is not social responsibility and it’s not justice.

An objection I sometimes hear from faculty to this line of thinking is that hey, they’ve got access to whatever they need, so obviously nobody’s being predatory, the system works, it’s all good. That’s just the most predatory thing I’ve ever heard! It’s basically “forget you, I got mine!” and it’s not an attitude I expect here at Loyola. And a thing I’ll ask you to do, because you are more enlightened than this, is to please challenge this when you hear it from your colleagues at other institutions. It’s just so predatory and so wrong.

Now, the way we supposedly get past the paywalls is through libraries, right? Libraries pay for access, publishers get paid for access, everybody gets access, everybody loves the publishers and librarians who make it possible, everybody wins.

Really? Everybody wins? Does everybody in this room feel like they’re a winner under this system? Seriously, I’m a librarian and I don’t feel like I’m winning! Look, let me just lay it out there, libraries are not the fix here, and partly that’s our own fault. Even those who can get into an academic library in the first place—and let’s not forget that’s not everybody—even a ginormous library system like the one where I work doesn’t have access to everything everybody needs. Am I right? So “library as access solution” is just not working, if it ever did, and it’s not going to start working again any time soon, sorry. It’s not that we librarians aren’t trying; we totally are. It’s that the whole game is rigged against us, and we need help and support to fix it. In other words, libraries are lunch. We have been lunch for the last 30 or 40 years, we’re still lunch, and if we’re going to stop being lunch without becoming predators, we have to change too.

Faculty, we need your understanding and your help to do that. When you blaze into your librarian’s office yelling about the latest journal cancellations, you’re being predatory. Instead, ask for the price, the usage numbers, and our cancellation rationale. We’ll show you. Librarians, when you just cave in to the faculty member who blazed into your office yelling about the latest cancellations, you are being predatory. Who are you eating? Other campus denizens who need materials that your library can’t buy because you’re catering to the people who yell the loudest. It’s not fair and it’s not okay, and all it does in the long run is get more people yelling at us because we’ve taught them that’s how to get what they want.

Here’s what we’re up against. Library budgets are flat as a pancake most places, and even where they’re rising, they’re not rising as fast as regular inflation, never mind actual journal-price inflation, which is leaps and bounds higher than general inflation. So libraries have been losing purchasing power by leaps and bounds, and all the money in the world wouldn’t actually catch us up. At the same time, the scholarly literature is only getting bigger! And explaining what happened here would take me more time than I have today, so I won’t—trust me, I could go on for hours; I teach a summer course where I actually do!

I will say that we have all been predators. We’ve all played the honey badger and decided not to care. We have all let this predatory system persist. Librarians signed on to Big Deals that were only ever going to blow up in everybody’s face. Faculty steadfastly refused to pay attention to what was going on except to blame librarians. Administrators wrongly thought that just giving libraries more money would solve the paywall problem. Publishers and aggregators have been playing some seriously ugly and cynical games with non-disclosure agreements and pricing. It’s just fail all around. And here’s the kicker—no, really, this is unbelievable but it’s true, people have done the math here: the money already in the system is plenty enough to do the actual work of publishing the literature. That money is just disappearing into the wrong pockets, not to mention being misspent to prop up paywalls.

Part of the problem is that reward schemes in journal-based academe don’t reward some things that they should. They don’t reward openness, and they don’t reward evidence that people actually read or used or cited your specific stuff. They reward publishing in journals with high Journal Impact Factors, and there are a million reasons Journal Impact Factor is total predation, but one of them is that high impact-factor journals tend to be paywalled. If you reward impact factor, you’re rewarding predatory paywalls. That’s changing, thank goodness, but there’s another reason to walk away from impact factor, too. Who falls prey to impact factor? Who gets eaten for lunch? Younger scholars pre-tenure, that’s who. The stupid pointless inaccurate impact-factor chase destroys promising research careers. That’s evil. And it’s entirely within academe’s control; you can’t blame this one on librarians or publishers.

Oh, except when librarians act exactly like academe—and yes, I happen to know there’s at least one Chicagoland academic library that relies on impact-factor based “prestige” for judging their librarians for tenure. You know who you are. Shame on you. Walk out of here determined to set a better example, please. Look, this goes for everybody, if your tenure and promotion guidelines even mention Journal Impact Factor, if your department uses it to assess your people and their work, I want you to walk out of here today vowing to change that, because it’s predatory in and of itself, and it feeds into other kinds of predation. If you rely on Journal Impact Factor, you are the predator.

It’s these predatory priorities in how we assess scholars that lead pretty directly to a lot of gross scammy behaviors in the journal market. Let me be totally clear here: so-called journals trying to scam author-side fees are totally a thing, but they are not common and not hard to avoid if you’re paying attention:

  • Did they spam your email? Walk away.
  • Home page mention Ulrich’s or Google Scholar? Walk away.
  • Catch them in a lie? Walk away.
  • Not in doaj.org’s list? Walk away.
  • Walk faster if it’s on DOAJ’s list of journals that lie about being DOAJ members.

They’re also not the only predatory scam artists out there. We have drug companies buying up journals to fool clinicians into prescribing particular drugs. We have coercive citation, “cite our journals or we won’t publish you,” which is totally predatory and wrong. We have vanity one-author toll-access journals, I kid you not, and it’s amazing how long they last before anybody does anything about them. And we have giant multinational corporations making 30% to 40% profit margins—profit! that’s pure profit!—telling universities and libraries with limited and still-decreasing budgets “hey, how about another giant price increase this year?” Now that’s predatory.

That article that came out, something like 80 million dollars supposedly heading to scam open-access journals? Trust me, I’d love to see the scammers gone too, but 80 million is a drop in the ocean, folks. It’s a rounding error against the multinational journal publishers’ profit margins. Be outraged! This is outrageous! But train your outrage where it’ll do the most good, okay?

Here’s what a system that doesn’t chow down on potential readers looks like: you find a journal article on the web that looks interesting, you download it or just read it right on the page, end of story. It should be that simple, right? Because who needs the hassle of anything else.

I encourage everyone to take a look at the so-called “altmetrics” movement, which is expanding our understanding of scholarly impact well beyond Journal Impact Factor, in ways that are respectful of the power of open access. Here’s an example, from one of my own articles published a couple of years ago in an open-access journal, and just as a disclaimer I’m now on this journal’s editorial board. You can’t read the text I’ve circled here, it’s too small, so I’ll just tell you that this article went wild on Twitter, over 200 tweets. Who saw them? Well, just adding up the number of followers on the tweeting accounts, you get over 312,000 followers. Now, of course the real number isn’t anywhere near that high because of Twitter follower overlap, and of course not everybody who sees a tweet is going to click through and read the article. But stick with me for a thought experiment here: if one-tenth of one percent of that upper bound, those 300,000 tweeters, found the article this way and actually read it, that’s over 300 readers! Let’s get wild here, if it’s one whole entire percent? That’s 3000 readers! Another article I wrote back in 2008 I put in the institutional repository I was running at the time, so I had access to its download stats. When I left to join the iSchool four years ago, it had something like 14,000 pageviews. Think this kind of math could make a difference to a tenure and promotion committee? I think it might. I surely think it ought to.

The numbers are kind of a sideline, though. What makes me happy about this is that I know for certain that anybody who saw a tweet about my article and was curious enough to click through could immediately download and read it. No paywalls, no nonsense, nothing in the way, exactly as it should be. Now me, I’m lucky enough not to be tenure-track, so I have the luxury of being in the publishing game purely to make a difference. Open access helps me do that. Paywalls would just eat my readership for lunch.

Now, not everybody is going to care about Twitter specifically. That’s okay! Be critical of what’s being measured and what it means. That’s awesome and you should do that; everybody else sure is. Altmetrics is way more than Twitter, fortunately. Just please, make sure you’re turning the same critical eye on the very-possibly-predatory metrics you’re already using, like impact factor. Don’t let that stuff escape scrutiny.

So, on another topic, does anyone recognize this young man? He’s Jack Andraka, who while he was in high school designed a new inexpensive cancer test based almost entirely on open-access research literature. Now, here’s the thing. Do you know who the next Jack Andraka is, in your discipline? Because it’s an educator’s dream, finding and influencing a brilliant new talent, right? Do you know where the next Jack Andraka goes to school? Do you know what journals the next Jack Andraka’s school library has access to? Well, yes, you probably do, it’s coterminous with “open access journals” because be real, high school libraries can’t afford journal paywalls. Do you even know what country the next Jack Andraka will come from? Odds are good it won’t even be the US! And there are lots of countries with, I must and do believe, lots of Jack Andrakas in them, whose access to the journal literature is unbelievably worse than it is here.

I don’t know who or where the next Jack Andraka is either. I also don’t know how many potential Jack Andrakas can’t become Jack Andraka because their research, their inquiry, their curiosity, their drive, gets eaten alive by predatory paywalls. So in a very real way, we don’t get more Jack Andraka-style wins until there’s a lot more open access. I think we as a society want those wins, I really do.

Anyone recognize this young man? His name was Aaron Swartz, and he tragically took his own life after becoming the target of federal prosecution over an attempted mass download of the JSTOR journal database. Now, I don’t want to say that lack of open access is solely or uniquely at fault for how the system chewed up and spat out Aaron Swartz. That would be ludicrously overstating the case. The social and technical and legal structures we’ve built up around scholarly paywalls did contribute, however, and all of us in academe—faculty, publishers, librarians, content vendors—need to recognize and own that part of the fail. I would like this never to happen again. With open access, I sure hope it wouldn’t need to.

Because this kind of predation is still happening, now, today—does anyone recognize this man? His name is Diego Gómez, he’s a dissertator, and he is facing a ruinous lawsuit in Colombia because he scanned and posted a dissertation online that was so important to his work he thought everyone should be able to read it. Now, Colombian copyright law is out of control, it’s even worse than ours, which is bad enough. But even beyond that, Diego Gómez would not even be in trouble in a world of open access! He wouldn’t have to post a PDF of a thesis that wasn’t his, he could just link to it! And again, isn’t it supposed to be that easy to acknowledge and credit our good influences?

In passing, it’s interesting to me that this is happening to Gómez over a dissertation specifically, because it’s exactly with dissertations that a lot of universities and a lot of young scholars are experimenting with open access. And the world hasn’t come to an end, the sky has not fallen, nobody’s been eaten by sharks that I’m aware. If anything, what I usually hear is great stories about young scholars making connections with established scholars over an open dissertation, open dissertations leading to unexpected-but-welcome publishing deals, undergrads and new grad students understanding the research process better because of open dissertations—I couldn’t ask for a better illustration of how amazingly useful and helpful open is.

So here we are. We want more open access because we want a system that isn’t turning willing readers and willing writers into lunch. So… why aren’t we there yet?

It boils down to what economists call—say it with me if you know it—a collective action problem. Basically, there are lots of stakeholders who want to see a less lunchy system. Students. Teachers. Researchers. Librarians. Policymakers and lawmakers. Research funders. Even some publishers, though not all. But no stakeholder can do it all alone, just too many moving parts and way too much inertia in this system for that! So it’s really easy for everybody to just sit back and say, hey, vegetarianism would be great, no more predators would be great, but there’s nothing I can do myself that will make a difference, so let somebody else do something. I won’t say nobody does anything, because we wouldn’t even be here today if a lot of people weren’t doing a lot of things, but a lot of people do nothing, so nobody can actually do enough.

So now what? How do we stop the predation? How do we push past the collective action problem? Well, that’s what Open Access Week is about, really—and when Loyola’s mission says that Ignatius believed in action, taking action to open access to more research is what it’s about.

What it amounts to is, there are a million excuses in the naked academy for not pursuing open access, and I have heard them all. You cannot surprise me; I’ve been doing this for a decade.

  • That’ll never work in my discipline!
  • But tenure! But promotion!
  • Oh, just give the library more money so they’ll shut up.
  • Who’s paying for this anyway?
  • I don’t know what a repository is, but I’m pretty sure we don’t have one.
  • But that’s work, and I’m too busy.
  • Well, my work isn’t really ready yet…
  • OMG copyright!
  • I heard it’s all a scam.

Maybe that’s how you feel. Maybe you are totally willing to make excuses to me when I can’t get to an article I want to read and maybe use in my teaching. Maybe you’re okay making excuses to Jack Andraka, or Diego Gómez! But I’m not, and I don’t think anybody in the academy should be. That is not what we are about here, especially this week. We’re about action.

There’s lots of ways to take action! You can take action as an author, as an editor or reviewer, as a good departmental citizen or a good university citizen, or as a good citizen generally, out there in the world of politics and advocacy. I’m going to make a couple-three specific suggestions, but you don’t have to follow them; they’re just ideas! Got a better way? Go for it! No matter what anybody tells you—and yes, I know there’s a lot of zealots out there and I wish there weren’t too—as long as you’re legal, there’s no wrong way to be open.

No matter where you are or what you do at Loyola, there’s something you can do. At base, there’s this: you can learn and you can teach. Because the collective action problem is real. Nobody can flip the switch to open alone; if it was that easy we’d have done it already. So make sure nobody’s alone! Let’s do this together.

Back in Madison over the summer this year, some folks started up a monthly open meetup—low-stakes, low-commitment, just interested people talking and teaching and learning from one another. It’s not a huge group. Yet. I mean, it hasn’t even been six months yet, these things take time! But it’s the right people gathering for the right reasons and I love that; I’m super-happy about this. (I’m allowed to say that. I didn’t organize this; I just go to the meetings and help out with the Twitter account.)

Could you do this here? Of course you could, and you’re so lucky to be in Chicago, because it’s dead easy to build a critical mass here! Just off the top of my head there’s a half-dozen people right here in Chicago who know their way around this stuff and are great facilitators. So go to it, monthly meetup, hopefully over deepdish pizza because this is Chicago, right? (In Wisconsin it’s about cheese.) You can do this!

You don’t have to stop at, or even start from, open access to the journal literature! Maybe you’re about open licensing, making sure the scary wing of the copyright-holder party doesn’t take over the entire cultural and scholarly universe, hello there Trade Pacific Partnership. If that’s you, I am with you! We need this kind of legal infrastructure to make open access work.

Or maybe you publish books rather than articles. I won’t lie, open access to books is a long way behind journal articles because the economic structures are different and so the solutions have to be too to some extent, but it’s starting to happen; we’re starting to figure out how to pay for it—I’m serious, just in the last month two or three major reports have come out about new models for sustaining monographs—and you can be part of that! I hope you’ll consider it.

Oh, and while I’m on the topic of the humanities, I hear two common excuses for why open-access humanities journals haven’t been happening. One of them is “our journals aren’t expensive! why should we have to change?” And here’s the thing about that: you’re right. You’re totally right; your journal prices are rounding errors. But that doesn’t get you off the predatory hook, sorry, because the big multinationals publishing science journals and making obscene profits are hiding behind you, claiming they’re just the same as you. They’re also eating you for lunch, of course, because the money libraries are paying to the big multinationals isn’t paying for books, but seriously, retire this excuse, okay? You’re just helping to rationalize a predatory system that’s eating you for lunch. Come on, I know you’re smarter than that.

The other excuse is “we don’t have big grants, how will we pay author-side fees?” And that’s based on the weird idea that author-side fees are the only way to pay for open-access journals. That’s not true. That’s never been true. I honestly don’t know why anybody believes this! The Open Library of the Humanities just launched with seven journals (soon, I am told, to expand), and it’s tremendously exciting. If you edit a humanities journal, if you’re thinking about starting one, give OLH a look. It’s open access designed for the humanities, without author-side fees.

Librarians in this room: I am bitterly ashamed to say that I checked the OLH supporter list and I didn’t see one single Chicago-area institution on it. Come on, y’all, if we don’t want to be predatory free-riders, if we’re serious about open access in general and open access to the humanities in particular, this is the best way I know to put our money where our mouths are. When you leave this room—or even before, I don’t mind if you use your mobile tech while I’m talking—make an appointment to talk with whoever does collection development in the humanities in your library. Let’s get some Chicago dollars floating toward OLH and Knowledge Unlatched and projects like that.

Faculty, you can do the same thing! Don’t yell—yelling is predatory and not cool—but suggest to your librarians that these are places you want to see library dollars going. The amazing Leslie Chan of Toronto—great human being, read up on him if you don’t know him—Leslie once suggested what he calls the “one percent solution.” And you can probably guess what that means just from the name, right? Take one measly percent of the library’s collections budget and toss it toward open access; it’ll make a greater difference than you can imagine. So take that home and think it over, too. One percent. Just one percent.

If you’re into open textbooks, open educational resources, we are totally friends, you and I! I am all in favor of keeping Pearson from devouring the universe, because frankly Pearson scares me. This here is the Open Textbook Library from Minnesota, and they’re looking for textbook reviewers from all over, so pitch in! And if an open textbook can work for you, please adopt one; this and OpenStax are great places to start looking.

Or maybe your thing is open data. We are still friends! Open data is my thing too!

Or maybe, librarians, archivists, and humanists, you’re interested in opening up the amazing wealth of unique material in local collections through digitization and open metadata. Great! I am with you! I’m a Community Representative for the Digital Public Library of America, which is trying to do just what I just said, so ask me anything about it after we’re done here. I think this opportunity is amazing, and I’m so pleased that so many libraries and archives and museums and personal collectors are getting involved! So there’s lots of ways to open up more wins. Please learn about them, this week and beyond. I think you’ll find a niche you’ll like.

For now, though, circling back around to the scholarly-journal literature…

If you create academic things? I’m asking you to make one of those things open this week. Just one. If you can do that with a journal article you’ve written, great! Legally, please—if you need help figuring out where you are with copyright, the librarians here or wherever you are can help you. But look, it doesn’t have to be a journal article, it can be anything academic you’ve made and hold rights to that would help other people. Normally I’d tell you I will pledge this alongside you, but I have a little bit of a problem there, namely that everything I’ve ever published that I hold rights to is already open, one way or another, and a whole lot of other stuff I’ve done besides.

Come join me in making work open. I’ve done some regrettable things during my career, but there’s one thing I have never regretted for an instant, and that’s making my work open. It has opened so many doors, created so many opportunities for me, introduced me to so many amazing people, some of whom have said to me, “I only got to read your article because it was open access, and it really helped me, so thank you.” What’s to regret about that, right? So join me. Jump in, the water’s fine and there are no sharks in it! Let open work for you too.

Now, you need to put your work someplace where it’ll stay there, okay? Not just any old web server—anything can happen to them and usually does—but someplace that somebody beyond you has a commitment to and will take serious care of. Time is the great devourer, especially of all things digital, and we’ll all retire someday, right?

Here is your library’s commitment to making your work open for the long haul, the long now, tomorrow and tomorrow and on into the future. This is Loyola University Chicago’s eCommons, and it’s run by your librarians here at Loyola. So I’m asking you to join your librarians in their commitment to making Loyola a more open place. To that end, here’s that goal for you again: by the end of this week, I want everybody in this room, every single one of you, to add your one thing, whatever it is, to Loyola eCommons. Can I get the Loyola librarians in this room to raise their hands? Okay, thanks. These are the folks you go to for help putting in your one thing.

One more thing. I’d like each of you to sign up one other person at Loyola this week. Just one. And convince that person to put one piece of work in. Just one! Because each one teaching one, that is how we get past the collective action problem.

As a department, I need you to fix your tenure and promotion practices, especially if they depend on Journal Impact Factor, which is utterly predatory. And humanities folks, if you’re still so fixated on print books that you can’t even imagine worthwhile scholarship in any other form, I don’t even know what to say to you except stop that, it’s 2015, come on! The Modern Language Association has guidelines for you; it’s time to implement them.

As an institution, it’s Loyola’s job to dream big, set the big goals. Here’s the one I’d suggest, just for starters: “Loyola wants no one interested in Loyola research
to be turned away by a paywall.” Then figure out how to get there! I know it’s tempting to just be reactive, or even to resent all this because you don’t want it imposed on you. I’m asking you today to be bigger people and a better university than that. Own open access. Own it yourselves, as an institution. Adopt this goal, or one like it, and own it, and make it happen.

That will take a lot of talking and a lot of work—that’s just life in the academy—but others have walked this road before you, and plenty of them are happy to walk alongside you, because we all win when we’re not being eaten alive by paywalls. I believe Loyola can do this. This is not something I say about every institution—I’ve been doing this for ten years; it’s been hard and I’m incredibly cynical, sorry. But Loyola has the right mission and the right people and the right preparation. I look forward to seeing what you accomplish!

Thanks again for inviting me, and I’m happy to answer questions!

Uncategorized

Who’s Devouring Whom? Introduction

Open access to the scholarly literature stopped being my job in 2011. I have yet to escape Open Access Week, however. A couple of years ago I had to tell some of our local students gently that no, I wouldn’t do yet another Open Access Week talk at UW-Madison as it made no sense for them or for me to have me do so; everyone there who ever wanted to hear what I have to say about open access already has.

I am still a useful second-string speaker at institutions that (for whatever reason) don’t spring for one of the top names, it seems. In contrast to my usual modus operandi, I don’t write a new talk every year for Open Access Week; instead, I have a mostly-stock talk that I adapt to the institutional context where I’m going.

Loyola University librarians told me that so-called “predatory publishing” was a concern among Loyola faculty and administrators. They also hoped for fewer electronic thesis embargoes and more (and more willing) participation in their institutional repository. I told them I’d try to speak to those issues, and I did try.

Uncategorized

Quia Faciendum Est

Thanks for that gracious introduction, I very much appreciate it! I’m Dorothea Salo, and I am not nor have I ever been a cataloger, though I may be one of the last generation of library-school students required to take a full course in cataloging, I don’t know. I just know we at the iSchool are doing a major curriculum revision, and current odds are that we’re keeping the core organization-of-information course, just changing it to de-emphasize MARC cataloging in order to include more material relevant to non-MARC environments. Interesting times, not that they’re ever not.

Curriculum revisions are never fun. They eat everybody’s time for months on end, they always cause bureaucratic hassles out the wazoo, and for a while you have to deal with two different incompatible curricula and remember which rules apply to which student you’re advising… it’s a mess. Why do we do this to ourselves? Why do we bother, if it’s such an awful hassle?

My answer is quia faciendum est, “because it must be done.” (Just in Latin it sounds cooler.) We at the iSchool can’t just sit back and do what we’ve always done because we’ve always done it that way, not when the world our graduates will need to fit into is way different from what it was when we built the old curriculum. It doesn’t mean we did a bad job on the old curriculum; I don’t think we did! It’s just that the world has changed out from under it. So we have to change the curriculum. We don’t have to enjoy it. We just have to do it, because it must be done.

Not coincidentally, that’s pretty much how I feel about the move away from MARC. It feels to me like a lot of the library profession has spent half a decade now, at least, on the question cur faciendum est? or “why must it be done?” And even though this question sounds really cool in Latin, I have completely run out of patience with it. Don’t even tell me nobody’s asking this still. I straight-up heard it just last May at a conference; it’s totally still out there floating in the water.

You like water? Here’s some water in a nice harbor, really pretty, love the lighthouse, but the thing that isn’t in this picture of a pretty harbor is a ship. Because the “why can’t we still use MARC?” ship has sailed, people! It has sailed. I’m not even having the why-do-we-have-to-change discussion today, I honestly don’t see the point, that ship has sailed.

I’m next year’s program planner for the IT Division of Special Libraries Association, and I was talking to SLA’s technical-services planner Betty Landesmann—some of you probably know her—I was talking to her about a linked-data session, and she rolled her eyes at me and said “can we not do another intro to linked data and why it’s better than MARC please? I’ve seen a ton of those and they don’t help.” Okay, then. If I’ve got catalogers yelling at me not to do this, I won’t do it! Because really, the answer is exactly the same as it is for our curriculum revisions at the iSchool. Quia faciendum est. Because it must be done.

I do want to mention, though, because I come at technical services from an XML-and-metadata background rather than a MARC background, that it isn’t just MARC cataloging the bell is tolling for here. There’s zero chance XML-based metadata practice will stay the way it is today; I already see it changing. I’m not even sure XML will stay alive as a pure metadata format, as opposed to uses like TEI for the digital humanities and EAD for archives, where you’re dealing with narrative-type documents intended mostly for human beings. I’m okay with XML’s decline as a metadata serialization. I never liked my nice elegant document standard getting worked over by the data engineers anyway—do not even talk to me about XML Schema; it’s just horrific. Maybe now I can have XML back. For documents. As it should be.

No, I’m much more interested in this question: quid faciendum est; what actually is it that must be done? What do we have to do to our catalog data and metadata so it works in this world where so much has changed about how we find information? I like this question because it’s pragmatic, I like it because it’s intriguingly complicated, I like it because it’s nerdy in all the best ways, I like it because I am an inveterate fiddler-with-things and there’s just great huge masses of MARC and XML right there to be fiddled with… and it’s another of those questions we have to work on or we just stay stuck, right?

I don’t think it’s enough to just say “well, we have to migrate our data from MARC and MODS and METS and the various Cores—Dublin Core, Darwin Core, VRA Core, PBCore and so on—we have to migrate all that to linked data.” That’s skipping all the steps! That’s like saying “we have to pick up some rocks and turn them into a giant mosaic.” Whoa, wait, not enough information! What’s our mosaic design? Where will the mosaic be built? Where do we find the right-colored rocks, and how many rocks of each color do we need, and how do we cut them down if they’re too big or not the right shape? How do we glue the rocks down? What if somebody makes a mistake? What if there’s an earthquake?

Process. That’s what we need here. Some process, right?

The other reason I don’t think it’s enough to just say “well, we have to migrate our data from what we have to linked data” is that it assumes without proof that linked data is the ultimate destination for it. It’s the horse to bet on, I’m not saying it isn’t, but I just think linked data gets used as a stalking horse sometimes, a scapegoat. It’s linked data’s fault we can’t use MARC, it’s linked data’s fault all these changes are happening, if it weren’t for linked data supposedly being the new hotness we could stay the way we are and everything would be fine.

I don’t actually think that’s true. If linked data didn’t exist—and let me tell you, I have lots of days I’d love to wipe RDF off the face of the earth—if there were no linked data, we’d still have to make changes in how we collect and organize our catalog data and our metadata. We have to make those changes for the same reason we’re changing the LIS curriculum at the iSchool: the world has just plain changed out from under the old ways. That didn’t happen when the Library of Congress or the British Library announced their linked-data plans. It happened long before that. It happened when paper cards gave way to the Web as the main way patrons interact with library catalogs. It’s not that we didn’t notice; of course we did. It’s just taken us a while to figure out what we need to do about it. I don’t know how to say “oops” in Latin, but oops, I do think we maybe waited longer than we should have. Water under the bridge now.

That leads me right back to quid faciendum est? What do we have to do, now that the work we do has to play nicely with computers? And not just computers—MARC was designed for computers—but networked computers, computers that can talk to one another. The network really does change the game.

When I teach our core organization-of-information course, I start my students with this thought: ordinatra stulta sunt. Computers are not all that bright, I tell my students; you’re way smarter than a computer. I do this for a lot of reasons. One reason is knocking computers off pedestals—not actually literally knocking computers off pedestals, though hey, that would be kind of awesome—but you know what I mean, right? A lot of my students come into the iSchool thinking that computers are like unto small gods, magical and capricious and liable to mess you up, impossible to understand much less work with. I’m saying, I have to get them to not think that, because the more they understand about how computers do work, the better off they are, and the better off we all are.

But the main reason that the notion that computers aren’t real bright is relevant to this talk today is that from our point of view as literate human beings, computers are not-too-bright in some very specific and fairly easy-to-understand ways, and those ways tell us pretty clearly what our catalog data and metadata have to look like if we want computers to work effectively with it. Really this is no different from how the shape and size of catalog cards and the standard size of typewritten lettering shaped how the MARC record had to look. The technology you have available—and the card catalog is totally a technology, don’t let anybody tell you it isn’t—the technologies available to you shape how it makes the most sense to do things, because different technologies are good and bad at different things and need different things to function best. That’s basic design theory; read Donald Norman’s Design of Everyday Things and its sequels if you haven’t already, it’s brilliant stuff.

So the first thing to remember about computers is that text, the ordinary stuff we write for other people to read—the text we literate human beings read and comprehend so fast and easily that we hardly have to think about it—it’s all Greek to a computer. (You knew that cliché was showing up at some point, right?) Computers can’t read; they are functionally illiterate. If anybody in this room has a kindergarten-aged child at home, that kindergartener most likely reads and comprehends text immensely better than a computer can.

So, in my head, one step toward coping with illiterate computers is dealing with our addiction to textual notes. By way of example, I took some MARC 504s straight from the Library of Congress’s MARC documentation—thanks for that, by the way, if anybody here is responsible for it, it’s super-helpful in my classroom:

  • 504 ##$aBibliography: p. 238-239.
  • 504 ##$aSources: p. 125-152.
  • 504 ##$aIncludes bibliographical references.
  • 504 ##$aIncludes bibliographies and index.
  • 504 ##$aChronological list of the author’s works: p. 469-475.
  • 504 ##$a”Selected bibliography”: v. 1, p. 351-358, v. 2, p. 234-236.
  • 504 ##$a”Spisok izdannikh knig”: p. 277.

So, for example, if a patron question we would like our catalogs to answer is “hey, I’m new to this topic, can I get a recent book with a good bibliography please?” these notes being free text means our catalogs can’t answer that question, because to get to an answer means filtering a list of books by whether they have a bibliography or not, and to do that with MARC notes, a computer has to understand that “bibliography” and “bibliographies” and “bibliographical” and “literature cited” and “sources” and maybe whatever that romanized Russian means—I didn’t look it up—the computer has to figure out that they all mean yes, there’s some kind of bibliography! And no, a computer can’t just look for the existence of a 504 and assume there’s a bibliography, because some 504s don’t say anything about bibliographies, just indexes.

Look. The computer is not bright enough to figure this out. It can’t read, much less read all the languages we transcribe stuff in, much less comprehend what it reads. That makes a lot of the stuff in our MARC records a lot less useful to patrons than it could be.

Computers do understand yes or no; that they’re real good at:

  • ☑ Bibliography
  • ☑ Index

Checkboxes are candy to computers. So for any conceivable criterion we want our patrons to be able to filter their catalog results on, we pretty much have to quit recording it in text and make it a checkbox—or radio buttons if there’s more than two options; that works too. If you get the sense from this that I like MARC fixed fields, you’re right! (Though honestly, that festschrift thing? That is just weird.) If a lot more of MARC had been expressed in fixed fields instead of free text, we’d be a lot better off right now.

Now, let’s say for a moment that the titanic arguments we in the profession will have to have about when a book can be said to have a bibliography, and what counts as an index… let’s say those are over, and we’ve drawn the best line we can. We still have to deal with this giant horrible mass of free-text notes in our existing catalogs that computers aren’t bright enough to understand. Faciendum est, people, it’s got to be done. I don’t want to get down in the weeds on this—actually I would totally love to because this is exactly the kind of problem I salivate over, but I’m a giant nerd and not everyone is like me in that. So I just want to say, this class of problem can be solved for the great mass of our records without hand-editing, and of course it must be, because there ain’t enough staff time in the universe to check all those checkboxes. And it’s a thing that must be done for every characteristic in our notes that we want users to be able to filter or search on.

Another serious and ugly free-text problem we have in our records has to do with places where our content standards don’t force us to be consistent about how we record certain information. Quoth the mighty AACR2r, “You know, whatevs. We’re cool. Just type something.” On catalog cards this inconsistency didn’t matter, because the information was only ever going to be skimmed by a human being, who doesn’t need a whole lot of consistency. We’re literate humans; we’re smart about figuring this stuff out. Computers are astoundingly literal-minded, however. You can take some text, add a space to it, and to the computer it’s suddenly something totally different. Just one space, that to a human isn’t even visible!

Once again, this isn’t just a MARC thing. Consistency in Dublin Core metadata? Don’t make me laugh. Actually, I’m more likely to cry. It’s bad out there, people. I mean, I once got an entire published article out of one poor soul in the institutional repository I was running at the time—poor guy had stuff under eight slightly-different spellings of his name. Just terrible, and yes, I fixed it as soon I had the screenshot I needed for the article—of course I did, I wouldn’t just leave it there in that state—but it goes to show.

I want to call out two Dublin Core things specifically, noting that you’ll find these problems many more places than just Dublin Core. Dates, oh my goodness, dates. Dates are really important to information-seekers, so it’s really important that we record them consistently such that a computer can reasonably intelligently filter based on them. We are not there yet; we’re not even close to there. Dublin Core, MARC, it doesn’t matter, people who try to make computers work are tearing out their hair about the ways we do dates, and we have to fix it. Faciendum est, we have to fix it.

Rights statements also come from the digital-collections side more than the catalog side. It’s important because we have to be clear about what our users are allowed to do with our digital collections, and to do that, our search engines have to be able to tell which users can do what with which items, and that’s all free-text now and it’s a total mess. Europeana and the Digital Public Library of America are working on it, thankfully, though I don’t envy them that job one bit.

An example of the ugliness of free text that I use in class a lot is from library software developer Bill Dueber, who took a close look at what was after the ISBN in the 020 field in the catalog he was working with. It’s horrific. Just the top twenty responses by number of occurrences, you can see the inconsistency:

  • 402537 pbk
  • 387406 alk. paper
  • 99260 v # (e.g., “v. 1”, “v. 22”, etc.)
  • 82918 cloth
  • 51125 hbk
  • 42036 electronic bk
  • 41360 acid-free paper
  • 38792 hardcover
  • 28913 set
  • 20358 hardback
  • 19160 ebook
  • 16264 paper
  • 15269 u.s
  • 12770 hd.bd
  • 11793 print
  • 10625 lib. bdg
  • 10520 hc
  • 8772 est
  • 7767 pb
  • 7639 hard

The more you drill down, the worse it gets. Dueber looked for everything that might reasonably represent a hardcover book:

hard | hard adhesive | hard back | hard bd | hard book | hard bound | hard bound book | hard boundhard case | hard casehard copy | hard copy | hard copy set | hard cov | hard cover | hard covers | hard sewn | hard signed | hard-backhard-backcased | hard-bound | hard-cover | hard-cover acid-free | hardb | hard\cover | hardbach | hardback | hardback book | hardback cover | hardbackcased | hardbd | hardbk | hardbond | hardbook | hardboubd | hardbound | hardboundhardboundtion | hardc | hardcase | hardcopy | hardcopy publication | hardcov | hardcov er | hardcovcer | hardcove | hardcover | hardcover-alk. paper | hardcovercloth | hardcoverflexibound | hardcoverhardcoverwith cd | hardcoverr | hardcovers | hardcoversame | hardcoversame as above | hardcoverset | hardcovertion | hardcver | hardcvoer | hardcvr | harddback | harde | hardocover | hardover | hardpack | hardpaper | hardvocer | hardware | hd | hd bd | hd. bd | hd. bd. in slip case | hd. bd.in sl.cs | hd. bk | hd. cover | hd.bd | hd.bd. in box | hdb | hdbd | hdbk | hdbkb | hdbkhdbk | hdbnd | hdc | hdcvr | hdk | hdp | hdpk | hradback | hradcover | hrd | hrdbk | hrdcver | hrdcvr

So our catalogs can’t answer the very simple question “yo, this book, print or electronic or both?” At least not based on the 020—and yes, I know RDA fixes this and I’m pleased about that. Bottom line, though, a lot of catalog data is hopelessly internally inconsistent. Sometimes that’s material for patrons and sometimes it isn’t, but when it is, this is my call to control all the things. All of them. Anything useful in a record that isn’t actually transcribed off the item needs a controlled vocabulary, or other appropriately-standardized expression if it’s something like a date. I cannot with this nonsense, and neither can computers. “Whatevs, just type something” is not okay in 2015. Transcribe it or control it; there is no third option. Faciendum est. Oh, and since I’ve said the word “transcribe,” let me just say, intentionally transcribing typos and other errors in information that’s material to a patron’s searching and browsing is completely mindboggling to me. We have to fix that stuff, and stop propagating mistakes. Consider it a service to publishers, as well as our poor patrons.

I picked on the 020 field for another reason too, having less to do with cataloging practices and more to do with ISBNs. Now, I know I can’t fool y’all the way I fool my org-of-info students with the question “is the ISBN a good book identifier?” We know it’s not. We know lots of books don’t even have ISBNs, and sometimes ISBNs get repeated for different books. It’s not totally clear what a “book” even is in ISBNland; it’s kind of an edition but not really, it’s kind of a format question but not really, and it’s all very confusing. Perhaps predictably, it’s confusing to computers too. Computers need to be really super unambiguous when talking about what kind of thing something is—if you and the computer have a different definition of what a “book” is, the computer is going to do random unexpected and unpredictable things from your point of view. The computer is happy to use whatever definition or definitions we’re happy with, the computer doesn’t care… but in spite of FRBR and sometimes (it must be said) because of it, we don’t really have clear definitions here that don’t lead us into logical contradictions or bad edge cases.

So that’s one thing. We have to figure out what exactly we’re talking about when we say things like “book” and “ebook” and “hardback” and so on, so we can explain the distinctions clearly to the computer… and if this reminds you of Suzanne Briet trying to explain when an antelope is a document and when it isn’t, I am right there with you! It’s totally going to be weird and sometimes theoretical like that.

Once we know what kinds of things we’re talking about, we have to be able to point clearly and unambiguously at every single example of these things that we—we collectively—have, so that it’s easier to pool the information about these things that we collectively have. The network can work for us if we let it—all of us know more than any of us about what we all have—but to let the network work, we have to have a common way to point at things. And for a computer, that means an identifier that (unlike an authority string) never, ever changes. And for a networked computer, that means an identifier that’s unique not just in your organization—so no, your call number or barcode number won’t work—but unique worldwide, so it absolutely cannot be language-dependent.

We’ve tried collating our records by fuzzy-matching and deduplicating; that’s how metasearch worked. But we pretty much all know that metasearch never worked real great. Computers aren’t bright enough to fuzzy-match well, and catalog data and metadata are sparse enough that they’re not good candidates for that approach to begin with. We’ll still have to use it to assign identifiers to start with, because we don’t have anything better, but it’ll be a long haul for some stuff and some catalogs. So that means we need unique identifiers for our stuff that are way more reliable than ISBNs. If you know the linked data world at all, you know that the scheme that’s been settled on for these is URIs, which mostly look like URLs. The reasoning there is that we already know how to make URLs globally unique, because they already have to be or the web doesn’t work. That’s all it is, just keeping the computers from getting confused.

Another problem we have to solve if we’re going to take advantage of the network is taking identifiers and quasi-identifiers that we’ve been using for things, like ISBNs and authority strings, and matching them up with URIs/URLs that have been established for those things. Again, I’m not going down in the weeds here, but I do want you to know (if you don’t already) that the nerd-word for this process is “reconciliation” and it can be partially automated if you know your source data well, as catalogers generally do.

Once you have a URI for something, you can go out to the network and ask a whole bunch of trustworthy sources “hey, what do you know about this thing?” and get back useful answers. To me, that’s how what we now think of as copy cataloging is going to work. Ask tiny questions of various reliable sources, get tiny answers, build them up into search indexes and facets and browsing tools and all the other UI chrome we’re already familiar with. It won’t have to be done by hand, if you tell a computer “every time I feed you a URI for a book we’ve bought, ask this question of that source in this way and store the answer there” it will happily do that, reliably and consistently, every single time. I strongly believe this will be a much better solution to what I think of in capital letters as the Problem Of Vendor Records. You’re familiar with this problem, I don’t have to elaborate, right? What I’m saying is, the Problem of Vendor Records is nine times out of ten a problem of vendors struggling not only with MARC and AACR2, but with MARC practices that are incredibly inconsistent across libraries. It’ll be a lot easier for us and for vendors if instead our computers ask their computers a lot of tiny questions with tiny answers.

That leads to one last thing about the 020, okay? ISBNs aren’t unique, the field includes inconsistent format information, yeah, yeah, we got that. Here’s my question: what the everliving heck is format information doing in an ISBN field to begin with? Much less information about volumes of a multi-volume series? And if it does have to be there—and I know, I know, I do understand why it ended up there—why isn’t it at least in a separate subfield? What’s this nonsense with parentheses?

I’m showing you a mosaic detail here for a reason. You can see all the teensy-tiny individual rock bits here, and you can see how carefully they’re placed, and that none of them actually overlap anywhere. That’s what our catalog data and metadata should look like. No overlaps, nothing jammed together, everything in tiny tiny bits and each tiny bit in its own singular place, very carefully set apart from all the other tiny bits. That’s called “granularity” or “atomicity,” and computers love it. Computers are really really good at building up whole mosaics from tiny granular pieces! What they’re critically bad at is taking a whole and breaking it into parts. We have to do that for them. And as we saw with the 020 field, we often don’t, or when we do, we do it in ways that the computer finds confusing and inconsistent. Here’s a relevant factoid to take home with you: Computers cannot reliably and consistently split human names and titles in human languages into their component parts. Naming is just too inconsistent across human cultures and languages for that to work. With pre-split names, though, it’s relatively easy to write rules to put the names back together intelligibly.

So I grabbed a couple of examples from IFLA’s ISBD supplement—and again, thank you for this, if anybody’s here from IFLA, this is a classroom life-saver—just to show the difficulties:

Object (cartographic ; visual) : unmediated

Scanglobe diameter 30 cm / by Replogle. — Scale
1:42,000,000. — [Chicago, Illinois] : Replogle
Globes, [2006?]

1 globe : col., plastic ; 30 cm (diam.)

Relief shown by shading and spot heights. Depths shown by shading and soundings. — “Scanglobe” is a trademark. — Mounted on spindle crowned by a plastic clockface, in a plastic meridian half circle, on plastic base. — Globe lights up from inside by means of electrical cord with switch button and interior bulb

Try to think like a computer for a second. There’s a whole lot of punctuation all up in here, and it’s not at all obvious (even to me, and I’m a human being) what’s being set off by it or what it means, or even if it means anything at all! I mean, look at the physical description of the globe there. Tell me what a period, a dot, means there. As a human being, I can figure out the dots there are calling out abbreviations. Now, can I just tell the computer to assume a dot always means an abbreviation? Of course I can’t! A dot doesn’t mean that in the other areas! And can anybody tell me why in the last area of the globe description, everything except the final sentence has a period at the end? It’s enough to make a computer programmer cry into her beer.

I lopped the area labels off for legibility, but I just want to point out, we have two competing sets of delimiters happening here—the areas, that are set off with whitespace, and what’s in each individual area, which is funky punctuation city. And when you add that to MARC, we’ve got a whole ’nother set of delimiters in the form of fields, subfields, and indicators. I respect my colleagues who teach cataloging! I could never do it! Because I cut my teeth on XML, where delimiters are totally cut-and-dried and straightforward, such that I find this mishmash completely bewildering!

A lot of my students find learning HTML and XML frustrating, because they’ve never had to be one hundred percent consistent about delimiters before. So they make tiny little mistakes like leaving off an angle bracket, and they haven’t learned to scan for those mistakes yet, and they don’t understand what the validator that does notice problems like that is trying to tell them, and it’s really frustrating for them. What I tell them is, suck it up and deal. Okay, no, I’m not actually that evil about it. I’m pretty careful to point out the kinds of errors that beginners usually make, and I tell them that everybody makes those errors, even really skilled and experienced people, and it’s okay, the whole point of validators is to help us get it right. But fundamentally, they have to learn to deal, and so do we. They don’t like that necessarily, and we don’t have to like it either, but too bad, faciendum est, it must be done. Don’t confuse the computer, folks! Reliable and consistent delimiter use is how we avoid confusing the computer. Delimit in just one way, delimit clearly, delimit unambiguously.

Even our XML-based metadata standards aren’t necessarily doing that, much less our MARC-based catalogs! MODS has composite fields and weird ISBD punctuation. I have to share this thing that broke my heart—I was digging into the history of MARC and ISBD internationally, and it turns out that the Germans were totally bent on killing ISBD punctuation out of MARC and relying on MARC delimiters only, which from the point of view of 2015 totally would have turned out to be the right decision, but English- speaking MARC went the ISBD direction instead, and just argh, 20/20 hindsight. One more mess we now have to clean up.

As a recap, here’s what we have to do: We have to get a handle on our free-text issues. When we’re saying the same thing, we need to say it the same way every time. We need to atomize our data, make it as granular as it can be. The delimiter thing, sometimes we have too many and sometimes we don’t have enough, and we need a happy medium. And when we can identify something as well as labeling it, we should, because identifiers make computers happy and useful.

Just to reiterate, we don’t have to do these things Because Linked Data. We have to do these things Because Databases, and Because Search Engines, and Because Faceted Browsing, and Because Internet, and Because Web. Basically, Because Not-Too-Bright Computers. Now, no secret, cards on the table: if we do these things, we’ll be one heck of a lot closer to having linked data. So it’s not like I’m ignoring that, I’m just saying linked data is not all this is about.

Well, that’s great. Plenty of work for catalogers, right? How do we actually do it, exactly? Quomodo faciendum est?

“By hand, one record at a time,” is not the answer here. It can’t be. I mean, yes, there will still be weird outliers that we end up fixing by hand—there always are, I could tell you stories—but we need to throw computers at fixing the easier problems and limit our handwork to those weird outliers. We have too much data to do it any other way. I’m not sure that doing it organization by organization is the way either. In my head, that means a lot of the same problems are getting solved redundantly in parallel. That costs too much and takes too long.

But the truth is, we collectively don’t yet have the know-how we need to collaborate on this. Most computer programmers don’t have the MARC knowledge, and many catalogers don’t know how to work with records en masse. So this problem? Solvitur discendo; it is solved by learning. Catalogers, developers, we all have some learning to do. I would say that, right? I’m a teacher. But I’m a teacher because I believe this.

Any managers in the room, supervisors? Give your people time and space to learn; do not make me yell at you about this! Your cataloging backlog or the new tech thing your developers are working on is far less important than your strategic preparation for what’s barrelling down the pipeline at you, okay? By all means hold your people accountable for actually learning as opposed to complaining about learning—again, I could tell you stories, but you can probably tell me stories too—but let them learn. Help them learn. Learn yourself, it won’t kill you and might make you stronger.

Here are some tools I think are well worth adding to your toolkit if they’re not there already, because they’re designed to fix stuff in lots of records at once rather than one-record-at-a-time. I’ve ordered them by the order in which I would recommend that a cataloger learn them. The last half you may not even need; I mention them because there are situations where they’re genuinely going to be useful.

  • MARCEdit
  • OpenRefine (try LODRefine)
  • Regular expressions (try regexone.com)
  • SQL (if you have to extract data from a relational database)
  • XSLT (if you have a lot of XML around)
  • Catmandu/Fix (don’t start here, Here There Be Yaks)

How should you learn, and what should you do with what you learn? In my head, those questions are intimately entwined. In some circles I’m known for the phrase “beating things with rocks until they work.” If you use mosaic rocks this probably won’t work, but you get the idea. I think plain old mucking around and breaking things and fixing them is the best way to learn new tools, myself—it’s certainly how I do it, and it’s the method I have my students use. Pick something to do with the tool, then do it, and if a few rocks get beaten on along the way, it’s all good.

What should you do exactly? Well, try cleaning up your catalog data! Go fix your 020s with MARCEdit, see if you can at least make those format notes more consistent. Export your data into something OpenRefine can read—it’s doable, I’ve talked to catalogers who did it—and see if you can cluster your 504s such that you can figure out in the majority of cases if there’s a bibliography, if there’s an index. Try fixing your dates, there are all kinds of fun and interesting problems you’ll run into and have to come up with some way to solve. You know your catalogs way better than I do; you know where the worst problems are. You also know where the most important problems to fix are, which I think is totally crucial knowledge here. So learn the tools by fixing the problems you already know are there.

We’ve got a lot of work to do, and it’s got to be done, so let’s jump in and do it. Thanks very much, and I’m happy to take questions.

Uncategorized

Quia Faciendum Est: Introduction

If there’s a harder speaking task than talking to catalogers about the imminent death of MARC (and that death is imminent), I don’t know what it is. It’s such a hard job that I found myself resorting to what for me is comfort art—Roman mosaics, which I love.

I can’t say I achieved miracles with this talk, given for the Potomac Technical Processing Librarians’ annual symposium; one attendee tried assiduously to convince me of the one true MARC way afterwards. I do think I managed to come across as non-threatening, though, which may well have been the best outcome possible.

Uncategorized

Discovering Standards: Adoptability and Adaptability

Standards… the final frontier. These… are the voyages of the starship NADDI… its continuing mission…

… no, but seriously, thank you, Barry [Radler], and hi everybody, it’s great to meet you all. I’m Dorothea Salo, and I teach XML markup and research-data management, among other things, at the iSchool here at UW-Madison. So of course I’ve known about the Data Documentation Initiative (DDI) for a long time, and have been watching it progress and gain adoption with great interest and delight.

Designing a markup language is hard! I know this because I’ve done it. Getting anybody to use a markup language is even harder! So I hope you are all proud of what DDI’s designers and user community have accomplished. I am certainly proud to stand here before all of you.

The tagline for NADDI 2015 is “Enhancing discoverability with open metadata standards.” I have to say, this is probably not everybody’s cup of chai. But if you’re here in this room, you’ve drunk the chai already: you truly believe that the standard life is a good life.

So do I! I believe! I believe in standards! If I didn’t believe in standards I wouldn’t have gone to an information school. If I didn’t believe in standards I wouldn’t have been interested in librarianship. If I didn’t believe in standards I certainly wouldn’t teach in an information school!

I love standards. I know about lots of them, and like anyone, I have my favorites and my unfavorites. What with all the hullaballoo about linked data, you might think that RDF is one of my favorite standards, but actually it’s my chief unfavorite. I work with RDF, I teach and train on it, I even give talks about it now and then, but I’m not a gigantic fan of it. I work with it and teach it because I have to; people need to know about it. I am, however, a cautious fan of schema.org microdata. I’ll be mentioning it again later. And just so you know, schema.org microdata has nothing whatever to do with what the DDI community typically calls microdata—so, yes, just the vocabulary in the standards landscape is a mess.

And now some of you are giving me sideeye with a “what in the world is wrong with this woman?! Standards are all fine and good, but moderation in all things!” Hey, I said librarians loved standards. I wasn’t kidding! Don’t try this level of dedication to standards at home; go to the library instead, okay? The point is, there’s lots of standards out there. So many standards! You almost have to be a librarian to love the standards universe, right?

No, but seriously, as I thought about what I wanted to say to you all today, I decided it was important to point out the very crowded and confusing standards and markup-languages space, not to mention the even more crowded and confusing best-practices space opening up around research-data management. I decided to hack the conference tagline, a little bit. (I do this. I hack things. You want to see my latest Mad Information Science hacking efforts, come on up to the iSchool library on the fourth floor of Helen C. White Hall, and I’ll show you my media-archaeology machine.) Instead of enhancing discovery with open metadata standards—which is a fine and worthy and very librarianly goal, don’t get me wrong—I decided to talk about enhancing discoverability of open metadata standards.

Why did I do that? Because in my head, that is the final frontier, the discovery and exploration frontier for standards just like DDI. In a crowded, confusing, competitive standards landscape, how does anybody get a standard noticed? How do you get it adopted? How do people whose problems your standard can solve discover your standard? How do they decide to adopt it, and how do you explain to them why they should adopt it in the first place? How does your standard, your one tiny galaxy in the giant universe, fit into the rest of their universe? That turns out to be a crucially important question these days, as it happens, because there is no One Ring—I mean, One Standard—there is no One Standard to Rule Them All.

I’ll say this again, because it’s important. There is no one universal data or metadata standard. There never will be. There never should be. With all the million different things we create data from, and do with data, and need from data, there’s just no way to create a single comprehensive standard that makes sense for every imaginable kind of data and data use case. Sorry, but not even DDI.

That means that inevitably—seriously, there’s no getting around this—DDI has no choice but to do two things. DDI has to compete for mindspace and adoption against other standards, not to mention non-standard technologies like Microsoft Excel, which is of course one of the horrors that DDI was designed to prevent. This competition takes place in what I already showed you is a huge, complicated, and confusing space. Secondly, DDI also has to fit itself into a universe where people will be using other standards and non-standard technologies alongside it, and they’d ideally like that to be easy for them.

DDI isn’t alone. This community is not alone in facing these challenges! In libraries, we are struggling with exactly the same thing right now. We honestly thought back in the 1960s and 1970s that we’d created a One Ring, a One Standard that would rule them all—or at least describe them all, everything, everything a library might collect. We called it MARC, Machine Readable Cataloging, and a brilliant programmer and systems analyst named Henriette Avram designed it in the 1960s, about the same time as XML’s precursor SGML and relational databases were coming into existence.

MARC was designed so that computers could hold, share, and print out the kind of metadata you find on a card in a library card catalog: author, title, subject, call number, copyright date, physical item description, and so forth. It stored metadata for books and also for other things libraries collect: maps, music scores, journal titles, and so forth.

By the way, you can impress your friends at parties with how long libraries have been standardizing stuff: the card catalog was invented by Melvil Dewey in the mid-1800s, and card size and catalog size were standardized by the American Library Association in 1876. As a standardista, I love libraries, I really do—you have to love it when practically the first act of a brand-new professional organization is to set a standard!

This goes to show how durable a useful standard can be. The physical card catalog survived as a standardized technology well into the 1990s, after all, over a century of use. And for half a century now, MARC has been librarianship’s freight train, our rail gauge, our standards heavy hitter. I can’t begin to explain to you the importance of this standard in librarianship globally. So what’s the problem with MARC? As I just said, MARC was designed for printing catalog cards. That means that we librarians designed our computerized record structure around a human-readable data format.

DDI didn’t actually do this. DDI isn’t intended for humans to look at directly. And that’s good! That was the right design decision to make! I want you to understand why it’s good, though, because it’s something that you may well have to explain to potential DDI adopters who expect something more human-friendly than raw DDI is.

We couldn’t really have known this at the time in libraries, but it turned out that basing our data structure for computers on something meant to be human-readable was putting the cart before the horse. This perfectly understandable and reasonable decision, the decision to build a standard around human-readability, actually hurt libraries and librarians in the long run.

One reason is that designing around the card catalog, which was then totally the ultimate in human-readable data display, meant serious problems when the ultimate in data display changed on us by going digital! There are eighteen long stories here that I’m passing over in silence, but the practical upshot is that catalog cards didn’t translate well to web pages, never mind web search engines like Google. Worse, the human-readable catalog-card format has turned out to be ridiculously hard to program against for computerized indexing and search.

DDI didn’t do this. DDI’s design is based on the structures inherent in the data, without making assumptions about how humans would want to see or manipulate it. And that was exceptionally wise, because humans don’t always want to see or use data the same way.

Another problem I’ll mention with MARC has to do with what I said earlier about there not being a single standard that handles every single use with equal ease and effectiveness. MARC tried to be that standard, for libraries. It failed, and we’ve been dealing with that failure for decades. Talk to any music cataloger! MARC was designed for books, not sheet music, and there are some key differences that it just doesn’t respect. Or talk to anybody who deals with CDs or DVDs or other multimedia. Just forget it, MARC’s terrible for that stuff. Even at the time, MARC was a poor fit for some of the existing library environment—librarians were just so laser-focused on books that they didn’t take the rest of library collections seriously enough.

I encourage the DDI community to look seriously at its edge cases, maybe even publicize them. Where is DDI being used in unexpected contexts? Is it a good fit? If it isn’t, could it be, or is the problem truly out of scope? Where is DDI being “misused,” and are any of the so-called misuses interesting enough to become real use cases? Sometimes the way standards achieve broad adoption is by paying closer attention to problems that weren’t in the original scope. Just a suggestion.

Returning to MARC, the really, really big reason that modeling MARC on human-readable catalog cards was a huge mistake for libraries has to do with data consistency, or more properly, lack thereof. Humans can usually—not always, but usually—read past inconsistency, or ignore it when it’s not important. You or I might chuckle or frown if we wound up driving behind a car with mismatched taillights in traffic, but we probably wouldn’t crash our car into it, right? Because inconsistent taillight coverings don’t matter to us really. We look right past them.

True story, one day when I was a new librarian, I accidentally wore one black shoe and one navy-blue shoe in the same style to work, and I was completely mortified once I noticed, but absolutely nobody else even saw it. This is an amazing, brilliant human skill, this ability to cope with inconsistency. We’re also, as a species, absolutely top-notch at dealing with this in text—most of us handle abbreviations, misspellings, smartphone autocorrect errors, no problem!

But this amazing human skill of tolerating inconsistency makes a mess of data structures, especially when computers enter the picture. While MARC was being designed, nobody cared whether its standards and practices were completely consistent. Formal consistency wasn’t even considered worth shooting for, because who would even notice really if it wasn’t there? Just like nobody noticed my shoes that day. So there are lots of places in library cataloging standards where the instructions just shrug and say “meh, put whatever you want, as long as people can understand it.” I kid you not, the standards say “grab a fortune cookie and write down what it says, meh, whatever, it doesn’t matter.” And I see some of you cringing, because you know the kinds of data analysis and data reuse problems that leads to—it’s part of why DDI exists, right?—and so do I, it’s just that in the 1960s nobody knew that yet… except maybe E.F. Codd, but relational databases were still being invented at the time, so never mind.

The real-world consequence of that decision has been that libraries are completely dependent on expensive, lousy, backward computer systems to run our operations. We’re stuck! MARC locked us out of using off-the-shelf or open-source software for the most part, partly because none of it was designed to read or write MARC—seriously, who even knows about MARC except librarians?—and partly because writing code to handle the records that were inconsistent because the rules didn’t tell anybody to be consistent in the first place is a computer programmer’s purgatory! It’s not easy, it’s not fun—that’s an understatement—so the open-source community waves it on past, and libraries aren’t a big enough market to attract much for-profit programming effort. DDI isn’t a really big market either, though I do know about Colectica and I’m glad it exists, but seriously, the more fun you can make working with DDI data, the more software the community will have. Consistent data is easy and fun to work with. Inconsistent data is not.

It gets worse. After MARC was standardized, library catalogers wanted to make catalog cards better for the people who used libraries, and when online catalogs came around, they wanted to make those work better too, but the only tool they had to make changes with was how they built their MARC records. So this completely praiseworthy “users first!” ethos among library catalogers meant that they dinked around with the structure and content of MARC records in inconsistent and computer-unfriendly ways.

This led, as you’d expect, to all kinds of inconsistency across records even just in a single catalog in a single library! As for records across all the MARC-using libraries in the world, just forget it—there is heinous amounts of inconsistency there, all in the name of making life easier for people. I sure hope this isn’t happening with DDI. What we didn’t know in 1960 but know really well now is that computers just cannot read right past inconsistency the way humans do. Generally they break. When they don’t break, it takes absolutely heroic programming effort to get them past the inconsistency. This, of course, is a major reason humans invent and use standards like MARC and DDI to begin with! Standards help design and enforce a degree of consistency that an unaided human being is generally not capable of and certainly won’t produce spontaneously.

Now, as we’ve seen with MARC, a standard is not an ironclad guarantee of consistency; HTML is another great example of this. People abuse standards, they don’t learn them well, sometimes they even insist on loosening standards up because they don’t want the validator yelling at them any more. By and large, though, the last best hope for consistency—anybody see what I did there? Babylon 5, getting all the geek jokes in today—the last best hope for consistency is still some kind of standard. But strictness, enforcement of consistency, comes at a cost. And in talking with researchers, and graduate students who are learning to become researchers, I’ve found it’s a cost that especially hurts at the standards-discovery stage. For a standard, the discovery stage is when people who don’t already use a standard on their data, but have that nagging uneasy sense that maybe they should, search the huge, complicated, confusing standards universe to try to discover the standard that they should be using.

The first question that someone in the middle of the standards-discovery process asks when they spot a likely standard is “Can I do this? Can I work with this?” Of course they ask other questions, but the first question, every single time, is a total gut-check can I do this? Strict enforcement of consistency makes standards harder to use, harder to experiment with, easier to mess up. Strict enforcement of consistency makes it a lot more likely that a standards-discoverer’s answer to the gut-check “can I do this?” question will be “nope, this is way out of my league, moving on now!” What I’ve found in my standards-building and standards-using life is that if the answer to that gut-check is “no,” honestly the only way a standard ever grabs that potential user back is by making them use it, which means a journal requirement or a funder mandate or a repository mandate or whatever.

The DDI community knows about this; the Inter-university Consortium for Political and Social Research (ICPSR) is DDI’s current enforcer. ICPSR has done a great job in that role, but it’d be nice if DDI had carrots as well as sticks, right? Not every social scientist engages with ICPSR, either.

The other way to encourage standards use is by making the standard use invisible by baking it into a tool, sort of like Colectica has tried to do. The problem with that is that people are persnickety about their tools. Not everybody will use the same tool if they’re not forced to. So we’re left with people looking at a standard that’s new to them and saying “I can’t use my favorite tool with this standard?! Well, forget this standard then!”

From your point of view, you want people with social-science data from surveys and interviews and the like to choose DDI, right? And you want people who need to understand or reuse that data to see that it’s in DDI and cheer, because they know they can figure out how to do what they need to do with it, right? So that’s two audiences of standards discoverers that DDI has to court, people who make social science data and people who use social science data.

So this tension between a standard that makes consistent computer-friendly data, and a standard that human beings can figure out how to use, is really important for the DDI community, an important cost to mitigate if you can. You want standards discoverers to encounter DDI and say “yes, I can do this!”

In libraries, we’re trying to figure this one out too. We pretty much know it’s time for MARC to go out to pasture. And we know this partly because MARC, in addition to making it harder and more expensive to run library systems, has been a serious barrier to getting everyone else in the world, from library vendors to programmer hobbyists, to work comfortably with what libraries know about what libraries have! I mean, I went over to Wikipedia’s article on MARC for a quick check on something and had to stop to laugh at the top cleanup note! If you can’t read it from where you are, it says “This article may be too technical for most readers to understand, blah blah fix it.”

Now look. When Wikipedia says “most readers” it really means “most Wikipedians,” and Wikipedians tend heavily toward the computer-nerdy. If computer nerds can’t figure MARC out, MARC has a pretty serious comprehensibility problem. So for this reason, and for the horrific inconsistency across the universe of MARC records that makes dealing with them via computer so difficult and frustrating, MARC’s got to go.

It’s looking pretty likely that the successor standards to MARC will be based on a technology called “linked data.” You may or may not have heard of linked data—I know DDI is currently working on three linked-data vocabularies, but it looks to me like it’s still early days for those—but look, honestly, it doesn’t matter if you haven’t. The point is, librarians are hunting a way forward through standards discovery. A lot of us are looking at linked data for the very first time, and let’s just say it’s not going as well as it might.

A lot of librarians have looked at linked data, done the “can I do this” gut-check, and had the answer be “oh my gosh, get me out of here, what even is this? I can’t with this!” So far, linked data has totally failed the gut-check test among librarians. It ain’t pretty, let me tell you: bone folders at ten paces, people. So I’m going to ask you all this, and you don’t have to answer me except in your heart. How often has DDI failed the gut-check test among social scientists? How many of your colleagues have taken one look at DDI and said “oh heck naw, are you kidding me?” If the number is as high as I suspect it is, what can the DDI community do about that? Library linked data, speaking sociologically, is a total mess, I can’t even begin to tell you. I don’t want the same for DDI. You don’t want the same for DDI.

Because no lie, I am a DDI fan, because I’m a digital preservationist—that’s another thing I teach—and I know what’ll happen to a lot of social-science datasets that should be in DDI but aren’t. They’ll glitch, like this image on the screen, and then they’ll die. That information will be unrecoverably lost. Ain’t nobody want that. I have another dog in this hunt too, and that’s this: the social science community, by and large, is light-years ahead of the rest of research when it comes to taking proper care of data. I really want other disciplines to learn from you people, because that’ll make my life as a digital research-data preservationist easier! But that brings up a consistency thing again. If even social scientists can’t converge on a standard as useful as DDI, how useful are social scientists as a model? So I need DDI to pass the gut-check test.

So let me close by making some suggestions, as an outsider to the DDI community who is nonetheless invested in DDI’s success, about how DDI might pass more gut checks, become more discoverable, more adoptable, and more adaptable.

After I stopped laughing at MARC’s Wikipedia page, just for the heck of it I looked up DDI’s. It does have one, and that’s great, that’s totally step one. But it’s got a blah-blah-fix-it note up too, this time about uncited information. Like it or not, Wikipedia is a place a lot of people go for that “can I do this?” gut check. Blah-blah-fix-it notes do not inspire confidence in these people. I really recommend a community Wikipedia hackathon day or whatever to fix this. One thing you may well find is that some or all of the uncited information in the Wikipedia page here doesn’t actually have an available, citable online source. That’s a problem! That’s a documentation problem for DDI! If there’s information basic enough to be in the Wikipedia entry, you absolutely want to ensure it’s in DDI’s website and documentation also.

So here’s DDI’s home page, the other likely place for that gut-check question.
And I love y’all, I really do, but tough love here: this page is not good. This page seems absolutely designed to make standards discoverers run screaming in the opposite direction. Just as a minor example, the first information after the navigation bar is the last-updated date. Nobody’s coming to this page looking for that; put it in the page footer where it belongs.

And then there’s the research-data lifecycle diagram, and look, I know lifecycle models were trendy in 2009 or so, but my experience is that they’re terrible communication tools. Nobody understands these things; they’re too vague and abstract for people who do research to see themselves and their workflows in. This one specifically, it’s not clear why DDI is at the center of the picture, or why it’s in this weird gear thing, and the picture doesn’t make clear what DDI actually does or how it helps with all the things in the blue boxes. Ditch this thing. Seriously, just dump it. It’s not helping DDI’s adoptability among social scientists.

What I might do instead, and this is only a suggestion, is to explain clearly what kinds of research and research data DDI works with. This is a quick list off the top of my head; you could probably do better:

  • surveys
  • interviews
  • codebooks
  • microdata

The point is, a researcher who doesn’t use DDI will come to this page, see that list, and if they make the kind of data that DDI is good for, they’ll immediately recognize that, which they can’t from this lifecycle diagram.

Then there’s DDI’s tagline, “a metadata specification for the social and behavioral sciences.” Two things about this. One, give me an estimate here, how many social and behavioral scientists have a sense of what “metadata” even means? I mean, it’s probably higher than some other disciplines, but in my experience, lots and lots of researchers bounce right off the word “metadata,” and its negative connotations due to our friends at the NSA probably don’t help much.

Two, take it from a librarian, DDI is not just a metadata specification! It contains metadata, sure, codebooks are metadata and instrument descriptions are metadata, but DDI is also a content and data specification! You don’t just describe your interview instruments or your survey methodology with DDI, you can also put the actual interview transcripts or survey results in DDI. This seems like a persnickety objection, and I won’t lie, it is! Putting my librarian hat on, though, librarians hold pretty strictly to the distinction between content and metadata. As some of you learned yesterday from my fellow librarians Brianna Marshall and Trisha Adamus and Kristin Briney, librarians are helping guide standards discoverers to standards these days, and if your home page misleads librarians about what the DDI standard actually does, I do think it’s a problem.

This is nitpicky, but exactly how many DDI specifications are there? To somebody trying for that gut-check, hearing that DDI is one specification from the tagline, and then seeing a couple inches down that it’s more than one specification is worrisome. It’s like a mini-bait-and-switch, like you’re trying to make DDI seem easier than it is.

Last thing: This page doesn’t even try to answer the gut-checker’s first question. can I do this? Heck if I can tell from this page. And no, nobody wants to start from the documentation, especially if it’s called that. Documentation is what you give your grad students so they can get on with it and you can ignore it, right? Y’all need a getting-started-with-DDI page here in the worst way.

Stepping back from specifics, there’s a thought-pattern that I want to encourage the DDI community to use: It’s not about what you can do with DDI, it’s about what I can do with DDI. No joke, I really mean this, DDI will not succeed or fail based on what you here in this room can do with DDI. You wouldn’t be here if you weren’t already knowledgeable, okay?! So DDI adoption is not about you. It’s about me, it’s about what I can do with DDI as a community outsider. Look, DDI wants me, because I train people who you hope will be community insiders someday! And I’m not the only outsider you care about, either. It’s students. It’s librarians helping people preserve data and find datasets. It’s journalists looking for stories, stories that might be lurking in your data. It’s web developers looking for interesting data to mash up. And bringing it back to the conference theme, it’s definitely about search engines. So much. So much about what web search engines can do with DDI!

As anyone who works toward web usability knows, the way you figure out what people think when they do that gut-check with your standard is to ask them. Hey, check out the DDI web page, do you get what DDI is now? No? Okay, what don’t you understand? And you revise your page from there. But that can be hard and time-consuming to do for every population of outsiders you’re interested in, and I actually think there’s a short-cut: ask educators, ask people who teach about DDI. Ask people who teach DDI—Jane Fry, are you here? Ask Jane Fry! We educators see people’s first encounters with new standards all the time. We can totally tell you what trips people up! It’s what I’ve just been doing, right?

Maybe you don’t believe me about outsiders, so I’m going to show you something. This is a question-and-answer website called Open Data StackExchange, which is where people who are interested in open data ask and answer each others’ questions. A ton of questions on this site revolve around social-science data, mostly where to find it. It’s kind of hilarious, the kinds of data people just assume somebody has; I really want to ask half the questioners on this site why on earth they think the dataset they want exists, but look, that’s not the point. The point is there are a lot of potential DDI users here, from both sides of the pipeline—data creators and data users. Do they know about DDI? Not from Stack Exchange presently; nothing comes up if you search on “DDI.” DDI is not part of this universe. As I keep repeating, even if somebody points them to it, when they ask themselves that gut-check question, “Can I do this? Can I do something with DDI? Is there something in DDI for me?” DDI really needs the answer to be “yes.” Right now it’s not. So what can DDI do to fit better into new users’ environments?

I think part of the answer for DDI, just as it is for libraries, is “fitting into the World Wide Web better.” And that’s why I bring up microdata, as I promised I would earlier. Again, sorry, terminology problem, your definition of microdata is the first one Wordnik gives, “data concerning individuals in a trial, survey, et cetera,” but that isn’t actually what I mean today. I mean the second definition, “data stored in a microformat.” This is a completely useless definition, of course, so let’s look up microformat: “a simple data format that can be embedded in a webpage.” Aha! Now we’re on to something, something that might help DDI fit better into the larger web.

Where you go to find out about web page microdata is a website called schema.org. They give an even better definition of microdata: “schemas webmasters can use to mark up HTML pages in ways recognized by major search providers.” And they go on to say that all the major search engines use microdata to improve how their search results look. Now, who wouldn’t kill to have Google actually understand what a DDI dataset is—just understanding that it’s a dataset would be a lot all by itself! If Google actually helped people find DDI datasets, and gave them an idea of what they’re looking at? Wouldn’t that be great? That is what microdata can do for DDI.

But does microdata understand what a dataset is, you ask? Why yes, yes it does! In a limited way, I grant you—you won’t be able to pack all your metadata into the web page for your project—but enough so that Google search results, before anybody even clicks on one, can say “this is a dataset about midlife, its called MIDUS, it’s by these researchers and it’s published by the UW Institute on Aging” and so on and so forth. Even better, microdata understands that datasets often come in catalogs, so if you have a project portal, you can totally tell Google that it’s a project portal with a whole bunch of datasets in it!

Coming full circle here, microdata is how I think DDI and DDI datasets should be leveraging their metadata to enhance their own discoverability. And even better, I think this will even help with the gut-check question from potential DDI users. If DDI makes it super-easy to create microdata for a project web page or portal, maybe through an XSLT stylesheet or an HTML-plus-microdata template or building it into existing DDI tools or whatever, I really think “it’ll be way easier for people to Google my dataset” is a pretty compelling statement of DDI’s worth.

So as DDI experiments with linked data and other possible serializations and representations for social-science data—and I know you’re doing that this very afternoon!—I encourage you to put microdata on the list. See what you can do with it. Let’s show standards discoverers what DDI is good for!

Thanks for sticking with me through all that; I hope some of it’s helpful. If you’d like to get in touch with me or you’re curious about what I do, my contact information’s on the slide there. Have a great day here in Madison, and long live DDI!

Uncategorized