Kanopy and Elsevier: united in password mishandling?

My introductory information-security course contains both undergraduates and iSchool graduate students. Every once in a while I get to drop in a library- or archives-specific tidbit, and today (the first class meeting after Spring Break), I had two among all the other news:

Shortly after the Kanopy breach broke, Jessamyn West passed on a very important question from Dan Turkel to Kanopy on Twitter: “Are you [Kanopy] storing user passwords in plaintext?”

Let’s back up and examine that question a moment, shall we?

“Plaintext” is information-security jargon for “not encrypted.” “Encrypted,” for our purposes, means “changed such that the original data cannot easily (or ideally at all) be figured out.” So, when Elsevier actually broadcast passwords in plaintext to all and sundry via some web dashboard, it disobeyed one of the fundamental best practices in infosec. If Kanopy was storing its passwords in plaintext, that’s just as bad.

(How do you know if a user’s password is correct, if you can’t store it figure-outably? Well, you know exactly how you changed it. When the user enters their password, you just change it the same way you originally changed the stored password, at which point you can compare the results.)

Nobody is supposed to store passwords in plaintext! Ever! Much less broadcast them in plaintext to all and sundry on a web dashboard! (What you are supposed to do with them is… complicated, and keeps changing as password-cracking software and hardware improves. Check with your favorite infosec expert, okay? And consider multi-factor authentication.) So what Turkel was asking Kanopy boils down to “okay, you were caught being careless; exactly how careless were you?”

Kanopy never answered, at least not on Twitter. This… does not exactly inspire confidence. Nor has Elsevier’s post-incident public relations on Twitter, which as best I can tell has substantially amounted to “it wasn’t that bad!” “everybody else has breaches too!” and similar sad, disingenuous deflections of responsibility. There are best practices in handling security incidents—perhaps unsurprisingly, infosec refers to them by the term “incident response.” These are not them.

I hope to have more to say about incident response in time, because it’s a thing more libraries will find themselves stuck doing—including when our vendors should but don’t—and the first step is always “have a plan for it.”

Authentication and the privacy of e-resource use

I haven’t been a library sysadmin or an e-resource librarian—missed my chance right out of library school, though I don’t precisely regret it as the library in question was somewhat troubled at the time (it has recovered admirably since). So if I goof up anything here, which I am likely to do, correct me on Mastodon. Still, I am definitely seeing a need for a How E-Resource Authentication Works explainer aimed at people who, like me, don’t worry about authentication for a living.

So, here’s the basic problem. Dr. Scientist wants to get access to some e-resources behind a paywall. The paywall needs to know if Dr. Scientist is allowed to do that. Dr. Scientist’s reading habits are properly none of anyone’s business, as long as the paywall has gotten its money.

Dr Scientist saying "Let me through!" to the paywall, which answers "Why should I?"

One way this can work is if Dr. Scientist establishes an account directly with the paywall, paying whatever the paywall demands—and giving the paywall plenty of personal information about her at signup, no doubt. Each time Dr. Scientist logs in with the paywall, then, the paywall can observe her search and download behavior—even her actual reading behavior if it presents articles in web-buggable HTML rather than PDF or whatever that horrible “enhanced PDF” garbage is that some paywalls are trying to shove on us. The paywall can also load its HTML pages with all the third-party trackers it wants. Dr. Scientist’s privacy from that paywall? Essentially nil.

It’s not coincidence that a number of paywalls have been pushing this model at developed-world Dr. Scientists pretty hard of late. “Bypass your tedious old library—get direct access!” Running roughshod over reader privacy represents a brand-new revenue stream for paywalls—and paywalls, unlike academic libraries, have made no public commitments not to.

The reason it isn’t typically a revenue stream already (pace the presence of web trackers on publisher websites) is that the vastly more common access model has the library in the middle. The library pays the paywall, the library makes sure that its patrons can get through the paywall to the stuff they want.

Dr. Scientist tells the library "Make the paywall let me through! You know me; I'm Dr. Scientist."

In this scenario, the communication between Dr. Scientist, library, and paywall can take place in a surprising number of ways. Here are a few.

IP address-range recognition

If Dr. Scientist is on campus, her computer will have an IP address assigned by campus IT. The library can send a list of campus’s IP ranges (that is, clusters of IP addresses) to the paywall. When the paywall sees Dr. Scientist’s request, it notices that the IP address belongs to campus and allows Dr. Scientist through without further authentication. The paywall doesn’t know who Dr. Scientist is at this point without doing extra work! It just knows that she comes from a campus that has paid the paywall.

Dr. Scientist’s privacy from the paywall under this model: fairly good, though not inviolate. The paywall can track and record what Dr. Scientist does on it (her “session”), and if the session is extensive enough or Dr. Scientist’s interests unusual enough, it might be able to figure out who she is that way. If Dr. Scientist’s office computer has a static (unchanging) IP address—as my work laptop does when it’s connected to my office’s Ethernet cable—her sessions there can be compiled into a fuller dossier on her. If the paywall is connected up with third-party trackers and data brokers, the chances of reidentifying nominally-“anonymous” Dr. Scientist skyrocket.

But Dr. Scientist likes to do her current-awareness reading in the local coffeeshop, whose wireless is not part of any campus IP range. So IP address-range recognition won’t always work to get her past paywalls.

Proxy servers

To fix this problem, academic libraries typically run what’s called a “proxy server,” a bit of software that asks Dr. Scientist to authenticate herself as belonging to campus, and then trots over to the paywall to get Dr. Scientist what she wants to read. Since the proxy server’s IP address is campus-based, the paywall lets it through (although there are other ways this can work, ways I find too obtuse to explain).

Dr. Scientist’s privacy here is tolerable—again, the paywall does not know who she is, and her requests will be mingled with all the proxy server’s other requests on behalf of the rest of campus, which is even better for her privacy—but there is a notable weak spot in it: server logs. Proxy servers keep logs! These logs absolutely do connect Dr. Scientist with exactly what she read.

Why do libraries let proxy servers keep logs, if librarians are so fired up about privacy? Excellent question. The answer explains why I personified the library as a besworded guard instead of, say, an usher or a maitre d’. If Dr. Scientist suddenly decides to do a text mining project, starting with an automated download of the entirety of the paywall’s contents from her favorite coffeeshop, the paywall will notice and will lean on the library. It’s then on the library to locate Dr. Scientist and tell her to knock it off, ideally before the paywall slams shut over all of campus. Without the proxy-server logs, the library can’t do that.

(Dear Dr. Scientists and Dr. Digital Humanists everywhere: Don’t. Effing. Start. Text-mining projects. On the literature. Without talking to. THE LIBRARY. First. I have heard the above story so many times from so many librarians at so many campuses about so many Dr. Whoevers in so many different disciplines. Knock it off, you overcredentialed yutzes. Librarians have better things to do than ride herd on you.)

How long are proxy-server logs retained? At entirely too many libraries, entirely too damned long. I have a reliable report of one library, when this question was asked, answering “Oh, they don’t take up much space; we keep them indefinitely.” This is appallingly poor privacy practice and that library should be ashamed of itself. Dump logs routinely, ideally via cron job or similar automation. Personally, I wouldn’t keep proxy-server logs longer than a day unless there was a reason to, but I suppose I could be talked into a week or so. (Any paywall that takes a whole week to lean on the library about a downloading problem deserves what it gets.)

By the way, any library using proxy-server logs in assessment or learning-analytics projects without deidentification and a whole lot of fuzzing should have its librarians sentenced to walking the halls at a major national library conference wearing sandwich boards proclaiming “WE TRASHED OUR PATRONS’ INTELLECTUAL FREEDOM.” If the proxy-server logs were not only not deidentified, but actively correlated with other sources of campus data about individuals, those sandwich boards should be lined with burlap and weighted with many pounds of lead. Not even slightly okay.

Hm, this post is long enough, I think, and I need to do a little more research before I continue, so I’ll tie it off here for now. Next up: single-sign-on and RA21!

(All clip art from Open Clip Art. I am very, very bad at art.)

Welcome to Tattle Tape!

It’s not pretty (I have a lot of CSS work to do), and it’s not really ready for prime time, but here we go anyway. Hi, I’m Dorothea Salo, and welcome to Tattle Tape.

Tattle Tape is a blog that will (mostly) be about patron and staff privacy in library contexts. (I’m a noisy person with a lot of professional interests. I know myself pretty well. I will blog about other things now and then. Might as well admit that to myself and everyone else.) It’s named after magnetic strips placed in library books that set off security gates, which tattle on possible thieves of library materials.

A few things are worth setting out at the, as it were, outset, by way of expectations management:

  • Tattle Tape is my blog. I’m not blogging at anybody else’s behest. What you read here is on me and nobody else—especially not any employer I have ever had or any project with which I have ever been associated.
  • Comments are off on most posts. I may occasionally open them. I can be found in the fediverse (Mastodon etc.) at @dsalo@digipres.club for those who are so inclined.
  • I am not nice. I have never been nice. I never will be nice. Nice does not appear among my goals for Tattle Tape. In particular, I have every intention of naming names with respect to publicly-perceptible privacy issues in libraries. If you published it, disseminated it publicly online (as with presentation slides, whitepapers, or whathaveyou), or I can find it out from your website by something as simple as clicking on UBlock Origin or Privacy Badger, it is fair game for Tattle Tape. (That said, I’m not a pentester or webapp bughunter, and if I were, I certainly wouldn’t do it for free. Your secrets are safe from me.)
  • Speaking of which: Part of the CSS work I need to do is getting rid of Google Fonts use in this blog theme. (I will do my best to get that taken care of this week.) Beyond that, other than ordinary Apache logging which I pay zero attention to and delete regularly, there is no logging or other tracking on Tattle Tape (nor the site it is part of), nor will there ever be unless someone hacks the site. I’m a librarian. My library school learned me real good about privacy. I don’t sell, much less donate, my readers to Big Data.

I can’t promise much by way of posting schedule. I’ll spare you the boring litany of busy, but yes, I’m overcommitted and barely staying afloat just now, and that doesn’t look like easing up real soon.

I know some useful things, though, and when I can make time, I’ll share them. Welcome to Tattle Tape.