Thinking through RA21

“Recommended Practices for Improved Access to Institutionally-Provided Information Resources,” colloquially known as “RA21,” is open for public comment until the 17th of May. I definitely plan to comment, but this post is not that comment—this post is the plonking, pedantic process of figuring out and refining what I actually think so I can make a comment that is at least slightly more cogent than “ARGH, NO, CAN WE PLEASE NOT” or “Most mendaciously misleading specification title ever, y/n?”

You may have gathered from the above that I am not a fan of this spec. Quite right. I’m not. You know, cards on the table and all that.

One reason I am not a fan is the authorship list (see page 2 of the spec PDF), which is a bunch of big-pig publishers with a sprinkling of IT people and… (as best I can tell) exactly zero librarians, LIS faculty, or library or technology ethicists. (Peter Brantley works in a library but is not a librarian. Usually I wouldn’t care, but in this specific situation, I think it matters. I also think it matters that Brantley is a longtime participant in, a project that has signally failed to get to grips with the staringly obvious truth that a tool that forces comments onto every website ever is a living invitation to harassment and abuse. Not, in my book, trustworthy on privacy or safety—and I can’t help wondering whether this is precisely why he landed on the authorship group.) How am I not supposed to suspect that the big pigs are pulling a fast one on privacy here? Especially given that the big pigs are presently a miasma of surveillance-capitalism FOMO?

(Seriously, this whole thing is giving me flashbacks to back in the day when the NIH held hearings on a possible open-access policy and only invited the big pigs. That was ridiculously regulatory-captured—though fortunately the NIH got over that—and so is this. Does NISO not notice when it’s being used as a patsy and a figleaf?)

So. One obvious threat model here: the big pigs are trying to use RA21 to accumulate saleable (or otherwise exploitable, perhaps via Facebook-style microtargeted advertising) data about individual library patrons’ reading, data they often can’t get now because proxy servers and IP ranges are in the way. Let’s take as read that aggregate use data is either not saleable/exploitable at all, or not nearly as lucrative to sell/exploit as use data where individuals are identified or reidentifiable.

Another threat model that springs to mind: individualized and/or reidentifiable surveillance of patron information use, likely in the name of “learning analytics” or “assessment.” The big pigs could want to do this. Institutions could want to do this. Unholy alliances between big pigs and institutions (e.g. the sordid roiling cauldron of Arizona State plus Cengage, or the blithely horrifying Unizin Consortium) could want to do this.

Quick reminder, academic librarians: ACRL says quite clearly and distinctly that “The privacy of library users is and must be inviolable. Policies should be in place that maintain confidentiality of library borrowing records and of other information relating to personal use of library information and services.” (Another of my long-brewing rants has to do with “confidentiality” as a word librarians doing Seriously Shady Stuff hide behind. Rant for another day, though.)

Um. Possibly I should back up and explain what RA21 even is first? Yes. I should do that.

RA21 and single sign-on (SSO)

So, I explained before how libraries deploy IP range recognition and proxy servers to get Dr. Scientist through publisher paywalls. RA21 wants to replace this constellation of systems—which, to be fair, is decidedly kludgy and maintenance-intensive—with SSO, which is fairly (though not universally) ubiquitous higher-ed IT infrastructure. If you work and/or study in higher ed, chances are good that whenever you use your institution’s login credentials, there’s an SSO server authenticating you and at least partially authorizing you to do things with, on, or to institutional systems.

The central ideas fueling SSO aren’t too hard to get one’s head around. Organizations that run an SSO server to authenticate their students and staff are “Identity Providers” or “IdPs.” An IdP knows and can share various facts about the people represented in it; these facts are “attributes,” or taken as a whole “metadata.” Then there are services that need to ask IdPs about people in the organization, often though not solely to authenticate them and authorize service use; these are “Service Providers” or “SPs.” SPs may be internal or external to the organization—that is, a campus system as well as a big-pig publisher could rely on the campus’s SSO server for authentication and authorization. For probably-obvious reasons, though, internal SPs tend to have broader access to more attributes than external SPs. Clear so far? Cool.

What attributes do SSO servers typically store and share on people? This varies by IdP, and can be shockingly hard to find good information about. (One of the reasons this post has taken so long is precisely the time involved in trying to back-of-the-envelope calculate a baseline for attributes across IdPs. I’m posting now partly because of NISO’s RA21 feedback deadline and partly because I’ve given up on using OSINT to figure out how IdPs handle attributes. It’ll take somebody like Educause running a survey.) Dear campus-IT units everywhere: be transparent about this, would you please?

Brown University has the clearest, completest attribute rundown I’ve yet seen. (Dear campus-IT units everywhere: imitate Brown, please.) A few things to notice here:

  • Several attributes are actual personal identifiers. Others constitute personally-identifiable information. Still more are indirect identifiers—combined, they could pick someone out of the pack even without an actual personal identifier. (Has anyone run a study on how k-anonymous the entire list of someone’s eduPersonEntitlements or isMemberOfs is? I’d love to know.)
  • External SPs—such as our big-pig publishers—can request that Brown hand them at least one personal identifier (eppnPersonalName).
  • The exact circumstances or process under which Brown would grant an external SP access to this identifier are not stated. (I looked but didn’t find anything. I may admittedly have missed something, or the documentation for this may be on a Brown intranet.) If there’s a generic recommended process and data-governance scheme for this, I haven’t found that either.
  • The supposedly-anonymized personal identifier (persistent-id) is unchanging, which allows an external SP to build up enough of a behavior-trail or web-tracker dossier over time to reidentify someone. (John Mark Ockerbloom has already pointed this out to NISO. Yay! That means I don’t have to!)

So could my RA21 threat models work?

Going by the RA21 spec, how could big pigs pile up a saleable/exploitable dataset of individuals and their information use? Well, trackers on their websites, of course, but that’s out of scope for RA21 so let’s leave it aside. Could they do it if they con institutions into helping? Perhaps in the name of the Library Value Agenda or similar?

On page 8, the fourth desideratum for RA21 reads: “The system will achieve end-to-end traceability, providing a robust, widely adopted mechanism for detecting fraud that occurs at institutions, vendor systems, and publishing platforms.” Unless the sole imagined fraud prevention method here is cutting off Dr. Scientist’s unpermissioned literature text-mining project without any further attempt to locate or punish her—which I would be okay with, actually—that suggests that on some level and for some time period, patrons’ information use must be traceable back to them by someone. On what level, for what time period, and by whom? And does RA21 allow the information necessary for fraud prevention to be (mis)used for other reasons, say, Big Pigs wanting to sell or exploit it? Who knows? RA21 doesn’t say. It damned well should.

The real meat is in section 2.3 starting on page 14. Line 1 column 2 of the table (minus the headers, you persnickety people) reads “Unless the SP has a specific, contractual agreement with an IdP, the IdP should only send anonymous and pseudonymous identifiers to the SP.” There. There. There is the incredibly giant loophole the big pigs will just waltz right on through. All SPs have to do to snoop individual patrons’ information use is convince IdPs to let them.

We trust IdPs implicitly… why exactly? Dear NISO: It’s called “insider threat.” Look it up. Then add it to your hilariously incomplete security/privacy threat model.

If I had better vidding skills than I do, I’d voiceover a scene from Disney’s Little Mermaid:

Big Pig: Come in. Come in, my librarian. We mustn’t lurk in doorways—it’s rude. One might question your upbringing… Now, then. You’re here because you have a thing for assessment. This, er, learning analytics doohickey. Not that I blame you—it is quite a temptation, isn’t it? Well, awed vocationist, the solution to your problem is simple. The only way to get what you want—is to send me an identifier attribute and ask me to collect the identified data for you.

Librarian: You can do that?!

Big Pig: My dear, sweet librarian. That’s what I do—it’s what I live for. To help underappreciated librarians like yourself. Poor souls with no one else to turn to.

I’ll spare you the actual filk of the song. (If you want to do it, though, go right ahead!)

It gets worse. Proxy servers and IP ranges are generally handled by the library, often with an assist from campus IT. SSO servers, however, are the domain of campus IT, with zero input whatever from the library. Campus IT does not subscribe to library ethics. Campus IT has different pressures on it than the library—most provosts, chancellors, and deans know better than to openly force the library to compromise privacy; the same is not true of campus IT (which often enough doesn’t give a crap about privacy anyway unless forced to). So the Big Pigs can probably have whatever identifiable data they want just by hinting to campus IT or campus administrators that it’d be useful assessment data to have, and wouldn’t you darling people just love to request it from us? All you need to do is send us the identifier attribute that makes it possible!

Not, frankly, that many academic libraries would resist temptation all that much better.

RA21 must put a stop to this scenario if it is to conform to library privacy norms. IdPs must only send identifier attributes in cases of suspected fraud, and when the case is closed, they must stop sending those attributes. SPs must only request identifier attributes in cases of suspected fraud, and when the case is closed, they must cease the request and discard all identified data. Furthermore, IdPs must never grant SPs access to attributes without the library’s explicit approval, and the library may only ever grant such approval for identifier attributes (or indirect identifiers in excess of what is absolutely necessary) in cases of suspected fraud.

Or, to boil it down to its essentials, try this: Neither IdPs nor SPs may sell, share, or use identifiable data, nor may they attempt reidentification of deidentified data, for any reason except as explicitly stated in (what had better be a new and very greatly improved) RA21. I mean, I suspect this to be a pipe dream; NISO hasn’t the guts to tell assessment and learning analytics to take a long walk off a short pier, and the Big Pigs clearly see assessment and learning analytics as an easy lockpick for the identified-data store. But I can dream, can’t I?

Line 2, column 2 expands on this theme: “In addition to the attributes provided by academic libraries, an additional attribute may be sent to the SP to support specific granular usage analysis or charge back requirements that the IdPs may have.” What does “granular” mean exactly? Who in the IdP may initiate such a request, for what reasons? What else can the SP do with this attribute? Who knows? RA21 doesn’t say. It damned well should.

I do approve of the last line on page 15: “In the case that the IdP sends more attributes than the minimal request from the SP, the SP must not collect or store that data under any circumstance.” Keep that. It’s not enough without a definition of “minimal request,” though—a Big Pig could perfectly well ask for every available attribute and call that minimal.

The rest of the spec is devoted to questions of UX. There isn’t a single additional word about privacy or security that I saw. So there we are. Threat model validated!

NISO, spec authors, the RA21 spec should not be approved, much less implemented, without vastly better data-governance protection for patrons against insider threat from IdPs and SPs than it currently contains. Fix it, please.