Authentication and the privacy of e-resource use

I haven’t been a library sysadmin or an e-resource librarian—missed my chance right out of library school, though I don’t precisely regret it as the library in question was somewhat troubled at the time (it has recovered admirably since). So if I goof up anything here, which I am likely to do, correct me on Mastodon. Still, I am definitely seeing a need for a How E-Resource Authentication Works explainer aimed at people who, like me, don’t worry about authentication for a living.

So, here’s the basic problem. Dr. Scientist wants to get access to some e-resources behind a paywall. The paywall needs to know if Dr. Scientist is allowed to do that. Dr. Scientist’s reading habits are properly none of anyone’s business, as long as the paywall has gotten its money.

Dr Scientist saying "Let me through!" to the paywall, which answers "Why should I?"

One way this can work is if Dr. Scientist establishes an account directly with the paywall, paying whatever the paywall demands—and giving the paywall plenty of personal information about her at signup, no doubt. Each time Dr. Scientist logs in with the paywall, then, the paywall can observe her search and download behavior—even her actual reading behavior if it presents articles in web-buggable HTML rather than PDF or whatever that horrible “enhanced PDF” garbage is that some paywalls are trying to shove on us. The paywall can also load its HTML pages with all the third-party trackers it wants. Dr. Scientist’s privacy from that paywall? Essentially nil.

It’s not coincidence that a number of paywalls have been pushing this model at developed-world Dr. Scientists pretty hard of late. “Bypass your tedious old library—get direct access!” Running roughshod over reader privacy represents a brand-new revenue stream for paywalls—and paywalls, unlike academic libraries, have made no public commitments not to.

The reason it isn’t typically a revenue stream already (pace the presence of web trackers on publisher websites) is that the vastly more common access model has the library in the middle. The library pays the paywall, the library makes sure that its patrons can get through the paywall to the stuff they want.

Dr. Scientist tells the library "Make the paywall let me through! You know me; I'm Dr. Scientist."

In this scenario, the communication between Dr. Scientist, library, and paywall can take place in a surprising number of ways. Here are a few.

IP address-range recognition

If Dr. Scientist is on campus, her computer will have an IP address assigned by campus IT. The library can send a list of campus’s IP ranges (that is, clusters of IP addresses) to the paywall. When the paywall sees Dr. Scientist’s request, it notices that the IP address belongs to campus and allows Dr. Scientist through without further authentication. The paywall doesn’t know who Dr. Scientist is at this point without doing extra work! It just knows that she comes from a campus that has paid the paywall.

Dr. Scientist’s privacy from the paywall under this model: fairly good, though not inviolate. The paywall can track and record what Dr. Scientist does on it (her “session”), and if the session is extensive enough or Dr. Scientist’s interests unusual enough, it might be able to figure out who she is that way. If Dr. Scientist’s office computer has a static (unchanging) IP address—as my work laptop does when it’s connected to my office’s Ethernet cable—her sessions there can be compiled into a fuller dossier on her. If the paywall is connected up with third-party trackers and data brokers, the chances of reidentifying nominally-“anonymous” Dr. Scientist skyrocket.

But Dr. Scientist likes to do her current-awareness reading in the local coffeeshop, whose wireless is not part of any campus IP range. So IP address-range recognition won’t always work to get her past paywalls.

Proxy servers

To fix this problem, academic libraries typically run what’s called a “proxy server,” a bit of software that asks Dr. Scientist to authenticate herself as belonging to campus, and then trots over to the paywall to get Dr. Scientist what she wants to read. Since the proxy server’s IP address is campus-based, the paywall lets it through (although there are other ways this can work, ways I find too obtuse to explain).

Dr. Scientist’s privacy here is tolerable—again, the paywall does not know who she is, and her requests will be mingled with all the proxy server’s other requests on behalf of the rest of campus, which is even better for her privacy—but there is a notable weak spot in it: server logs. Proxy servers keep logs! These logs absolutely do connect Dr. Scientist with exactly what she read.

Why do libraries let proxy servers keep logs, if librarians are so fired up about privacy? Excellent question. The answer explains why I personified the library as a besworded guard instead of, say, an usher or a maitre d’. If Dr. Scientist suddenly decides to do a text mining project, starting with an automated download of the entirety of the paywall’s contents from her favorite coffeeshop, the paywall will notice and will lean on the library. It’s then on the library to locate Dr. Scientist and tell her to knock it off, ideally before the paywall slams shut over all of campus. Without the proxy-server logs, the library can’t do that.

(Dear Dr. Scientists and Dr. Digital Humanists everywhere: Don’t. Effing. Start. Text-mining projects. On the literature. Without talking to. THE LIBRARY. First. I have heard the above story so many times from so many librarians at so many campuses about so many Dr. Whoevers in so many different disciplines. Knock it off, you overcredentialed yutzes. Librarians have better things to do than ride herd on you.)

How long are proxy-server logs retained? At entirely too many libraries, entirely too damned long. I have a reliable report of one library, when this question was asked, answering “Oh, they don’t take up much space; we keep them indefinitely.” This is appallingly poor privacy practice and that library should be ashamed of itself. Dump logs routinely, ideally via cron job or similar automation. Personally, I wouldn’t keep proxy-server logs longer than a day unless there was a reason to, but I suppose I could be talked into a week or so. (Any paywall that takes a whole week to lean on the library about a downloading problem deserves what it gets.)

By the way, any library using proxy-server logs in assessment or learning-analytics projects without deidentification and a whole lot of fuzzing should have its librarians sentenced to walking the halls at a major national library conference wearing sandwich boards proclaiming “WE TRASHED OUR PATRONS’ INTELLECTUAL FREEDOM.” If the proxy-server logs were not only not deidentified, but actively correlated with other sources of campus data about individuals, those sandwich boards should be lined with burlap and weighted with many pounds of lead. Not even slightly okay.

Hm, this post is long enough, I think, and I need to do a little more research before I continue, so I’ll tie it off here for now. Next up: single-sign-on and RA21!

(All clip art from Open Clip Art. I am very, very bad at art.)