Authentication and the privacy of e-resource use

I haven’t been a library sysadmin or an e-resource librarian—missed my chance right out of library school, though I don’t precisely regret it as the library in question was somewhat troubled at the time (it has recovered admirably since). So if I goof up anything here, which I am likely to do, correct me on Mastodon. Still, I am definitely seeing a need for a How E-Resource Authentication Works explainer aimed at people who, like me, don’t worry about authentication for a living.

So, here’s the basic problem. Dr. Scientist wants to get access to some e-resources behind a paywall. The paywall needs to know if Dr. Scientist is allowed to do that. Dr. Scientist’s reading habits are properly none of anyone’s business, as long as the paywall has gotten its money.

Dr Scientist saying "Let me through!" to the paywall, which answers "Why should I?"

One way this can work is if Dr. Scientist establishes an account directly with the paywall, paying whatever the paywall demands—and giving the paywall plenty of personal information about her at signup, no doubt. Each time Dr. Scientist logs in with the paywall, then, the paywall can observe her search and download behavior—even her actual reading behavior if it presents articles in web-buggable HTML rather than PDF or whatever that horrible “enhanced PDF” garbage is that some paywalls are trying to shove on us. The paywall can also load its HTML pages with all the third-party trackers it wants. Dr. Scientist’s privacy from that paywall? Essentially nil.

It’s not coincidence that a number of paywalls have been pushing this model at developed-world Dr. Scientists pretty hard of late. “Bypass your tedious old library—get direct access!” Running roughshod over reader privacy represents a brand-new revenue stream for paywalls—and paywalls, unlike academic libraries, have made no public commitments not to.

The reason it isn’t typically a revenue stream already (pace the presence of web trackers on publisher websites) is that the vastly more common access model has the library in the middle. The library pays the paywall, the library makes sure that its patrons can get through the paywall to the stuff they want.

Dr. Scientist tells the library "Make the paywall let me through! You know me; I'm Dr. Scientist."

In this scenario, the communication between Dr. Scientist, library, and paywall can take place in a surprising number of ways. Here are a few.

IP address-range recognition

If Dr. Scientist is on campus, her computer will have an IP address assigned by campus IT. The library can send a list of campus’s IP ranges (that is, clusters of IP addresses) to the paywall. When the paywall sees Dr. Scientist’s request, it notices that the IP address belongs to campus and allows Dr. Scientist through without further authentication. The paywall doesn’t know who Dr. Scientist is at this point without doing extra work! It just knows that she comes from a campus that has paid the paywall.

Dr. Scientist’s privacy from the paywall under this model: fairly good, though not inviolate. The paywall can track and record what Dr. Scientist does on it (her “session”), and if the session is extensive enough or Dr. Scientist’s interests unusual enough, it might be able to figure out who she is that way. If Dr. Scientist’s office computer has a static (unchanging) IP address—as my work laptop does when it’s connected to my office’s Ethernet cable—her sessions there can be compiled into a fuller dossier on her. If the paywall is connected up with third-party trackers and data brokers, the chances of reidentifying nominally-“anonymous” Dr. Scientist skyrocket.

But Dr. Scientist likes to do her current-awareness reading in the local coffeeshop, whose wireless is not part of any campus IP range. So IP address-range recognition won’t always work to get her past paywalls.

Proxy servers

To fix this problem, academic libraries typically run what’s called a “proxy server,” a bit of software that asks Dr. Scientist to authenticate herself as belonging to campus, and then trots over to the paywall to get Dr. Scientist what she wants to read. Since the proxy server’s IP address is campus-based, the paywall lets it through (although there are other ways this can work, ways I find too obtuse to explain).

Dr. Scientist’s privacy here is tolerable—again, the paywall does not know who she is, and her requests will be mingled with all the proxy server’s other requests on behalf of the rest of campus, which is even better for her privacy—but there is a notable weak spot in it: server logs. Proxy servers keep logs! These logs absolutely do connect Dr. Scientist with exactly what she read.

Why do libraries let proxy servers keep logs, if librarians are so fired up about privacy? Excellent question. The answer explains why I personified the library as a besworded guard instead of, say, an usher or a maitre d’. If Dr. Scientist suddenly decides to do a text mining project, starting with an automated download of the entirety of the paywall’s contents from her favorite coffeeshop, the paywall will notice and will lean on the library. It’s then on the library to locate Dr. Scientist and tell her to knock it off, ideally before the paywall slams shut over all of campus. Without the proxy-server logs, the library can’t do that.

(Dear Dr. Scientists and Dr. Digital Humanists everywhere: Don’t. Effing. Start. Text-mining projects. On the literature. Without talking to. THE LIBRARY. First. I have heard the above story so many times from so many librarians at so many campuses about so many Dr. Whoevers in so many different disciplines. Knock it off, you overcredentialed yutzes. Librarians have better things to do than ride herd on you.)

How long are proxy-server logs retained? At entirely too many libraries, entirely too damned long. I have a reliable report of one library, when this question was asked, answering “Oh, they don’t take up much space; we keep them indefinitely.” This is appallingly poor privacy practice and that library should be ashamed of itself. Dump logs routinely, ideally via cron job or similar automation. Personally, I wouldn’t keep proxy-server logs longer than a day unless there was a reason to, but I suppose I could be talked into a week or so. (Any paywall that takes a whole week to lean on the library about a downloading problem deserves what it gets.)

By the way, any library using proxy-server logs in assessment or learning-analytics projects without deidentification and a whole lot of fuzzing should have its librarians sentenced to walking the halls at a major national library conference wearing sandwich boards proclaiming “WE TRASHED OUR PATRONS’ INTELLECTUAL FREEDOM.” If the proxy-server logs were not only not deidentified, but actively correlated with other sources of campus data about individuals, those sandwich boards should be lined with burlap and weighted with many pounds of lead. Not even slightly okay.

Hm, this post is long enough, I think, and I need to do a little more research before I continue, so I’ll tie it off here for now. Next up: single-sign-on and RA21!

(All clip art from Open Clip Art. I am very, very bad at art.)

Breaking the panopticon: who’s watching library patrons, and can we stop them?

Teaching from the real world is pure joy most of the time. Students love it when they see something from class in the pixels of library journals and magazines, the mass media, or the technology press. Most of the time, discussing change while it’s happening is a visceral lesson in professional adaptability and continuous learning. I could have done without having to teach technology-related privacy issues to my “Digital Trends, Tools, and Debates” students in the shadow of the NSA’s newly-revealed surveillance practices, however.

Those who watch my Twitter feed have lately endured many 140-character howls of helpless dismay as I read the tech press in the late afternoons. Leaving that anger aside as I wrote and recorded lectures nearly broke me. Boiling immensely complex facts based on technologies no less complex into a snappy and comprehensible lecture is hard enough, but it’s a challenge I’m well-used to; disciplining myself to avoid bursting into spittle-flecked rants was the hard part.

As I always do, I explained to my students why I chose to teach them about this. My own visceral outrage aside, the simplest reasons call back to parts of the ALA Code of Ethics:

II. We uphold the principles of intellectual freedom and resist all efforts to censor library resources.

III. We protect each library user’s right to privacy and confidentiality with respect to information sought or received and resources consulted, borrowed, acquired or transmitted.

V. We treat co-workers and other colleagues with respect, fairness, and good faith, and advocate conditions of employment that safeguard the rights and welfare of all employees of our institutions.

VI. We do not advance private interests at the expense of library users, colleagues, or our employing institutions.

What price intellectual freedom and freedom to read, never mind privacy and confidentiality, when the NSA has built weaknesses into security standards and frameworks that could help other snoops grab every byte passing through a library computer, or over the library wireless network? When Amazon tracks library checkouts to Kindle devices, creepily attaching buy-this-book come-ons to due-date notices? When any number of commercial data warehouses track patron information behavior on the computers and wifi networks libraries provide?

The Internet in general and the web in particular have become Jeremy Bentham’s panopticon. That panopticon unquestionably surveils us and our patrons. If libraries are truly to be the privacy-protecting, commercial-free civic spaces they aim to be, shouldn’t we librarians extend the principles of the ALA Code of Ethics to digital environments as well? What would that take?

The scope of the problem

This isn’t only about circling library wagons against the NSA. Who surveils library staff computers? In many K-12 environments, the answer is obvious and in some ways troublesome. To my surprise, however, schools do not harbor the only library environment where employer surveillance may appear. When I asked, several librarians in academic and public libraries privately voiced suspicions to me that either library IT or the IT establishment in the library’s parent organization was logging behavior on work computers. Even more troublesome: they did not know what was and wasn’t logged, had no available policy on the question, and could not find out more. I don’t find this uncertainty indicative of what the Code of Ethics terms “respect, fairness, and good faith.”

My entirely unscientific and not-to-be-relied-upon information gathering for this column suggested that surveillance may be commoner when library IT is not controlled by the library. This makes intuitive sense. Not only do many corporate, government, and academic IT centers not share library ethics, they operate under different constraints and directives. A library, for example, can push back against overreaching copyright enforcement directives; we understand fair use and consider fair-use advocacy part of our mission. When the RIAA, a major serials publisher or aggregator, or similar copyright-owner interests lean on IT, however, IT has little choice but to make the problem go away with minimal hassle and minimal legal risk to the larger institution. This is liable to mean surveillance (in the form of log monitoring at minimum) and no-longer-neutral web access.

As for warding off surveillance from private interests, I’ve been teaching my Digital Trends students about the commercial web-tracking establishment and available techniques to defeat it for years. When I asked Twitter and FriendFeed whether any libraries had defended against this surveillance by adding anti-tracking plugins to the web browsers in stock patron or staff computer configurations, however, I came up completely empty. I found that both unexpected and troubling. I would dearly love comments here from librarians who have considered this issue and implemented privacy-protecting measures in their libraries!

Ignorance is part of the problem, certainly. My own wake-up call came a couple of weeks ago, when I interviewed Brendan O’Connor, a student in the UW-Madison School of Law, about the cheap, Tarot-deck-sized wifi surveillance box he calls the “F-BOMB” along with its monitoring system CreepyDOL, built as a proof-of-concept assessment of the privacy threats involved in much normal everyday network use. Before talking to Brendan, I hadn’t any notion how much data wifi-enabled devices such as laptops, tablets, and smartphones regularly and unstoppably leak, nor how oblivious to personal-data leakage many websites (including librarian favorites such as newsfeed-readers) are. Supposedly I teach technology! If there’s this much I don’t know, when I make constant and regular effort to keep up with technology-related privacy issues, I can’t help but be concerned about the level of awareness in librarianship generally. How can we decide what to do about a phenomenon we don’t understand?

What to do?

That we as a profession have a duty to advocate with legislators and technology providers for better privacy protection in communication protocols, on websites, and in mobile platforms seems beyond question. Frustrated with the stalemate he perceives in the technology establishment around personal privacy, Brendan O’Connor suggested to me that privacy protection could be usefully framed as a consumer-safety issue. I think that a promising approach, but I see no reason standard library ethical stances around personal privacy as an inescapable component of intellectual freedom and citizenship cannot make themselves heard as well. Available fixes are highly technical, of course, but the needed advocacy to force the technology establishment into making those fixes relies on exactly the sort of ethical suasion that libraries and their professional organizations excel at.

What immediate technical fixes could libraries implement? When I brought up browsing privacy on FriendFeed, librarian Aaron Tay of the National University of Singapore wondered whether I was advocating that all libraries place their computers on the (possibly NSA-compromised, but still best-of-breed) TOR anonymizer network. I’ve used TOR now and then, so I know it stresses bandwidth and degrades the apparent responsiveness of web browsing somewhat; I don’t doubt many of our patrons would find this an unacceptable tradeoff. Stephen Francoeur of Baruch College noted that anti-tracking browser plugins, if poorly chosen or poorly configured, could block cookies that some websites require in order to function properly. Both critiques have merit.

To my mind, libraries can consider a continuum of responses, with universal TOR implementation, perhaps allied with a draconian Javascript-killer like NoScript that is known to break many websites, on the extreme (doubtless infeasible) end. On the other end of the continuum lies pure education: block nothing, explain everything. The website “Terms of Service; Didn’t Read,” which grades the quality of the privacy policies at many commonly-used websites, offers a plugin for many popular browsers (Internet Explorer excluded, unfortunately) that puts its grades right in the browser interface for perusal. Some anti-tracking plugins, Ghostery for example, can be configured not to block, but to display information about which trackers are active during a web browsing session. I encourage everyone who works in libraries to investigate and test these plugins, at home if not at work! Let us share what we learn, so that librarianship as a whole starts to frame a digital-privacy strategy.

Where is the middle course? The “Do Not Track” browser preference, lackadaisical though support for it is, is worth triggering by default just as a statement of intent. Anti-tracking plugins are well worth considering for library staff and patron machines also. I’ve been using them for some years, and hardly ever notice browsing problems. On the rare occasion a site does break, the fix is generally a two-click temporary disabling of the anti-tracking plugin for that site, something I hope could be relatively easily taught to reference and tech-support staff. Wifi security is rather weak still, and its implementation unquestionably creates tech-support issues, but with a heavy heart I confess it now seems preferable to open access points to me.

As for surveillance closer to home, at minimum libraries owe their staff transparent policy and procedure. Even libraries with no choice but to surveil staff, as in many schools, should be straightforward about what is happening. Even libraries who don’t control their own computers can challenge IT to be transparent and to protect privacy whenever possible. We can at least avoid turning into mini-NSAs, hiding snooping behind silence and obfuscation!

It is true that some anti-tracking technologies create browsing hassles. It’s also true that institutions we favor and rely upon, such as news media, themselves rely on tracking to improve their balance sheets as they move online. Finally, it’s true that some digital invasions of privacy are well beyond our control. As I thought about all this, though, I found myself repeating “not in libraries, not here” over and over again under my breath. The NSA may, legally or no, track the web traffic of foreign nationals, catching many American citizens in the backwash, but not here. Advertisers may compile behavior portfolios for promiscuous sale, but not here. Social media may track their users across the entire web, but not here. Digital panopticons may spring up like weeds, but not here. Not here. Here, in libraries, privacy should be the default.

I am grateful to Myron Groover, the Library Society of the World, and Twitter correspondents who wish not to be named for giving me examples of library-computer surveillance and helping me shape my thinking. I am not affiliated in any way with the websites or browser plugins mentioned herein, except as user and classroom demonstrator.

Note: This post is copyright 2013 by Library Journal. Reposted under the terms of my author’s agreement, which permits me to “reuse the work in whole or in part in your own professional activities and subsequent writings.”