Triumph, RomAn-21-style

It is pretty nice to see librarians step up. That’s what just happened with our good buddy RA21. The chariot awaits, everybody. Climb in!

But I’ll be the whisperer behind the chariot: This probably isn’t over. I mean, RA21 will certainly go back to the drawing board, and a fittingly-embarrassed NISO will be at least a tiny bit more careful not to be so obviously vendor-captured next time…

… but the STM Association is a hard nut to crack. I’ve tangled with them before, and watched others tangle with them as well. They’re persistent, they don’t subscribe to library ethics, and they use every dirty trick in the book and a few outside it.

I’m less worried than I was, since my chief fear was that the STM Association with NISO as convenient clueless patsy would sneak RA21 past libraries unopposed. I’m now confident that won’t happen. What will happen is hard to guess—the standards space is labyrinthine and the STM Association knows it pretty well, so finding or even making another, quieter patsy is hardly off the table. We’ll just have to wait and see.

Physical-equivalent privacy

One of the confounders thrown at privacy advocates inside and outside libraries is “privacy is too nebulous a concept to operationalize!” I’m sympathetic to a point, because the philosophers and the ethicists and the lawyers and the coders have made rather a tangled mess of things, but frankly in my head I call this confounder an “ethics smell” (after programmers’ “code smell,” which is a hint somewhere in otherwise-working code that something in the program’s design or construction is very, very wrong). It’s a cheap excuse anyone can use to pretend privacy doesn’t actually matter and shouldn’t be a consideration. We who value privacy are still stuck answering this smelly objection, though, so how do we usefully do that?

In some library conference talks I’ve done, I’ve groped toward a formulation I’m now calling “physical-equivalent privacy.” That is, if we wouldn’t track a print book, or a person using the physical library, in a particular way, the digital analogue to that tracking behavior is also not okay. Put more formally, “the library patron using library-provided electronic information should enjoy privacy protection equal to that of the same patron using the same information via a library-provided physical information carrier.” This is not a perfect analogy, let me just state that up-front—physical surveillance is also ramping up in all too many contexts, even in libraries—but it productively tickles most folks’ sense of what’s creepy, and I think it also activates a lot of tacit operational-privacy knowledge in librarianship.

Let’s walk through an example: usage counting for spaces. Physical first.

  • Gate counts, or a count clicker? Not terribly creepy.
  • Counts by a person walking through the library? A little creepy, but the creepy factor can be managed through a clear explanation (“we’re only counting, not recording anything else about anybody”).
  • Counts that record possibly-visually-assessable demographic information (let’s say gender, race/ethnicity, and socioeconomic status) about patrons? Decidedly creepy, especially for patrons of already-oversurveilled populations, as well as quite prone to inaccuracy—one wouldn’t want to predicate services on such an assessment if one has the common sense of an oyster. Nor would one care to explain this practice to patrons, because not a few patrons (self included!) would flee the library for good.
  • Counts by a person who asks each patron about their demographic information? Highly, highly creepy—we’ve definitely arrived at “scary” now, if we hadn’t already.
  • Counts by a person who asks each person their name or library-card barcode number? So scary that the library would likely empty out.
  • Combining either of the last two with recording the library material that the patron is reading, watching, or listening to? Absolutely beyond the pale; I would expect huge protests from patrons, working professionals, and professional orgs.
  • Combining demographic information, name or other identifier, and materials choice? Ugh, just forget it—that’s utterly beyond scary into full Orwell.

You may have noticed that I left video surveillance out of the above list. You’re right; I did, and consciously so. Some people feel safer in the presence of video surveillance. I think they’re very often wrong to! But that still means I can’t usefully invoke video surveillance to tickle folks’ creepy meter, because it won’t always work as I’d intend it to. Audio surveillance might be an effective creepy-meter tickler, though, and it is salient in the context of voice-activated mobile phones and other devices.

Your sense of creepitude may differ from mine, but if you’re a librarian, I’d guess we’re fairly close together. (Caveats: some librarians don’t understand or believe that children also need and deserve information privacy, and some librarians have been utterly seduced by surveillance capitalism and its cousins “analytics,” “library value,” and “personalization.”) So let’s take that sense of creepy and apply it to digital analogues.

  • Gate counts? Analogous to unadorned usage counts—and I do mean “unadorned,” no IP addresses, no login information, minimal to no geolocation, exact time fuzzed at least to hour. Not particularly creepy in and of itself, but I might argue that the persistence of this information, and lack of disclosure about where else it will go and how it will be used, starts to get just a little creepy.
  • Counts by a person walking through the library? Hm. Let’s call it a person looking at the above unadorned usage counts. Not creepy—but also not visible to patrons in a way that begs to be explained to them. One important factor in creepiness is “doing something to/about me without my knowledge, much less agreement.” This fuels the often-researched observation that people get less comfortable with digital surveillance the more they find out about it.
  • Counts that record demographic information? In academic libraries, this can take the form of correlating library and library-materials use with information from the institutional student-records office. A lot of learning analytics projects have done this, not thinking anything of it. It’s certainly more accurate than the impressionistic physical version, but does that really cut the creepy any? As for public libraries, this is what customer-relationship-management tools like OCLC Wise, currently being heavily flogged at conferences, run on. It’s hella creepy in my book, and it’s noticeably less privacy than a patron using the physical library and physical library materials has. Not okay. Not okay!
  • Counts by a person who asks each patron about their demographic information? Same as above, really, except again, the “ask” part is missing. Learning analytics researchers slice and dice the data by demographic regularly, even when the numbers end up so small as to be wholly ungeneralizable—but highly reidentifiable. It’s creepy. It’s wrong. When the end goal is “personalization”—that is, a group of few or even one!—the creep factor multiplies further.
  • Counts by a person who asks each person their name or library-card barcode number? This one’s easy: many electronic-materials vendors in both public and academic libraries either collect this information directly or make the library collect it (e.g. by proxy server). For library-computer and network use, device identifiers (like mobile phone MAC addresses or IDFAs/ADIDs) come into play. And again, learning-analytics research has extended this identify-the-patron paradigm into information-literacy instruction, computer and network use, interlibrary loan, and other areas where it absolutely doesn’t belong. Creepy as all get-out.
  • Combining either of the last two with recording the library material that the patron is reading, watching, or listening to? Learning-analytics research, again. Have I gotten across how creepy that stuff is yet? Have I? I mean, go read it if you don’t believe me.
  • Combining demographic information, name or other identifier, and materials choice? Even learning-analytics researchers typically fuzz the materials choice. Not always, though, and some of them seem to wish they didn’t have to.

See how this works, kinda? I hope so. Another way to work it is starting from a digital-surveillance situation and working out its physical analogue. My favorite example, which I’ve used in my classroom, is the Adobe Digital Editions fiasco from 2014. To get this level of information about use of a checked-out print book, you’d have to follow the patron home and watch them as they read it! Not even slightly okay! So why is it okay when it’s an ebook?

I’m not pretending this or any analogy can be a silver bullet. I do think this could be a valuable rhetorical technique to sway folks on the fence about the latest library-privacy invasion, or to face down the Library Value Agenda at conferences or library-internally. In that latter case, I suggest combining it with a “you-first” strategy, to bring in the problems of power differentials and nonexistent or forced consent: “How about we pilot this by tracking library staff in this fashion? No? Why is it okay to do this to students, then?”

Good luck. I’d be interested to know about folks trying this, and how it turns out for them.

Thinking through RA21

“Recommended Practices for Improved Access to Institutionally-Provided Information Resources,” colloquially known as “RA21,” is open for public comment until the 17th of May. I definitely plan to comment, but this post is not that comment—this post is the plonking, pedantic process of figuring out and refining what I actually think so I can make a comment that is at least slightly more cogent than “ARGH, NO, CAN WE PLEASE NOT” or “Most mendaciously misleading specification title ever, y/n?”

You may have gathered from the above that I am not a fan of this spec. Quite right. I’m not. You know, cards on the table and all that.

One reason I am not a fan is the authorship list (see page 2 of the spec PDF), which is a bunch of big-pig publishers with a sprinkling of IT people and… (as best I can tell) exactly zero librarians, LIS faculty, or library or technology ethicists. (Peter Brantley works in a library but is not a librarian. Usually I wouldn’t care, but in this specific situation, I think it matters. I also think it matters that Brantley is a longtime participant in hypothes.is, a project that has signally failed to get to grips with the staringly obvious truth that a tool that forces comments onto every website ever is a living invitation to harassment and abuse. Not, in my book, trustworthy on privacy or safety—and I can’t help wondering whether this is precisely why he landed on the authorship group.) How am I not supposed to suspect that the big pigs are pulling a fast one on privacy here? Especially given that the big pigs are presently a miasma of surveillance-capitalism FOMO?

(Seriously, this whole thing is giving me flashbacks to back in the day when the NIH held hearings on a possible open-access policy and only invited the big pigs. That was ridiculously regulatory-captured—though fortunately the NIH got over that—and so is this. Does NISO not notice when it’s being used as a patsy and a figleaf?)

So. One obvious threat model here: the big pigs are trying to use RA21 to accumulate saleable (or otherwise exploitable, perhaps via Facebook-style microtargeted advertising) data about individual library patrons’ reading, data they often can’t get now because proxy servers and IP ranges are in the way. Let’s take as read that aggregate use data is either not saleable/exploitable at all, or not nearly as lucrative to sell/exploit as use data where individuals are identified or reidentifiable.

Another threat model that springs to mind: individualized and/or reidentifiable surveillance of patron information use, likely in the name of “learning analytics” or “assessment.” The big pigs could want to do this. Institutions could want to do this. Unholy alliances between big pigs and institutions (e.g. the sordid roiling cauldron of Arizona State plus Cengage, or the blithely horrifying Unizin Consortium) could want to do this.

Quick reminder, academic librarians: ACRL says quite clearly and distinctly that “The privacy of library users is and must be inviolable. Policies should be in place that maintain confidentiality of library borrowing records and of other information relating to personal use of library information and services.” (Another of my long-brewing rants has to do with “confidentiality” as a word librarians doing Seriously Shady Stuff hide behind. Rant for another day, though.)

Um. Possibly I should back up and explain what RA21 even is first? Yes. I should do that.

RA21 and single sign-on (SSO)

So, I explained before how libraries deploy IP range recognition and proxy servers to get Dr. Scientist through publisher paywalls. RA21 wants to replace this constellation of systems—which, to be fair, is decidedly kludgy and maintenance-intensive—with SSO, which is fairly (though not universally) ubiquitous higher-ed IT infrastructure. If you work and/or study in higher ed, chances are good that whenever you use your institution’s login credentials, there’s an SSO server authenticating you and at least partially authorizing you to do things with, on, or to institutional systems.

The central ideas fueling SSO aren’t too hard to get one’s head around. Organizations that run an SSO server to authenticate their students and staff are “Identity Providers” or “IdPs.” An IdP knows and can share various facts about the people represented in it; these facts are “attributes,” or taken as a whole “metadata.” Then there are services that need to ask IdPs about people in the organization, often though not solely to authenticate them and authorize service use; these are “Service Providers” or “SPs.” SPs may be internal or external to the organization—that is, a campus system as well as a big-pig publisher could rely on the campus’s SSO server for authentication and authorization. For probably-obvious reasons, though, internal SPs tend to have broader access to more attributes than external SPs. Clear so far? Cool.

What attributes do SSO servers typically store and share on people? This varies by IdP, and can be shockingly hard to find good information about. (One of the reasons this post has taken so long is precisely the time involved in trying to back-of-the-envelope calculate a baseline for attributes across IdPs. I’m posting now partly because of NISO’s RA21 feedback deadline and partly because I’ve given up on using OSINT to figure out how IdPs handle attributes. It’ll take somebody like Educause running a survey.) Dear campus-IT units everywhere: be transparent about this, would you please?

Brown University has the clearest, completest attribute rundown I’ve yet seen. (Dear campus-IT units everywhere: imitate Brown, please.) A few things to notice here:

  • Several attributes are actual personal identifiers. Others constitute personally-identifiable information. Still more are indirect identifiers—combined, they could pick someone out of the pack even without an actual personal identifier. (Has anyone run a study on how k-anonymous the entire list of someone’s eduPersonEntitlements or isMemberOfs is? I’d love to know.)
  • External SPs—such as our big-pig publishers—can request that Brown hand them at least one personal identifier (eppnPersonalName).
  • The exact circumstances or process under which Brown would grant an external SP access to this identifier are not stated. (I looked but didn’t find anything. I may admittedly have missed something, or the documentation for this may be on a Brown intranet.) If there’s a generic recommended process and data-governance scheme for this, I haven’t found that either.
  • The supposedly-anonymized personal identifier (persistent-id) is unchanging, which allows an external SP to build up enough of a behavior-trail or web-tracker dossier over time to reidentify someone. (John Mark Ockerbloom has already pointed this out to NISO. Yay! That means I don’t have to!)

So could my RA21 threat models work?

Going by the RA21 spec, how could big pigs pile up a saleable/exploitable dataset of individuals and their information use? Well, trackers on their websites, of course, but that’s out of scope for RA21 so let’s leave it aside. Could they do it if they con institutions into helping? Perhaps in the name of the Library Value Agenda or similar?

On page 8, the fourth desideratum for RA21 reads: “The system will achieve end-to-end traceability, providing a robust, widely adopted mechanism for detecting fraud that occurs at institutions, vendor systems, and publishing platforms.” Unless the sole imagined fraud prevention method here is cutting off Dr. Scientist’s unpermissioned literature text-mining project without any further attempt to locate or punish her—which I would be okay with, actually—that suggests that on some level and for some time period, patrons’ information use must be traceable back to them by someone. On what level, for what time period, and by whom? And does RA21 allow the information necessary for fraud prevention to be (mis)used for other reasons, say, Big Pigs wanting to sell or exploit it? Who knows? RA21 doesn’t say. It damned well should.

The real meat is in section 2.3 starting on page 14. Line 1 column 2 of the table (minus the headers, you persnickety people) reads “Unless the SP has a specific, contractual agreement with an IdP, the IdP should only send anonymous and pseudonymous identifiers to the SP.” There. There. There is the incredibly giant loophole the big pigs will just waltz right on through. All SPs have to do to snoop individual patrons’ information use is convince IdPs to let them.

We trust IdPs implicitly… why exactly? Dear NISO: It’s called “insider threat.” Look it up. Then add it to your hilariously incomplete security/privacy threat model.

If I had better vidding skills than I do, I’d voiceover a scene from Disney’s Little Mermaid:

Big Pig: Come in. Come in, my librarian. We mustn’t lurk in doorways—it’s rude. One might question your upbringing… Now, then. You’re here because you have a thing for assessment. This, er, learning analytics doohickey. Not that I blame you—it is quite a temptation, isn’t it? Well, awed vocationist, the solution to your problem is simple. The only way to get what you want—is to send me an identifier attribute and ask me to collect the identified data for you.

Librarian: You can do that?!

Big Pig: My dear, sweet librarian. That’s what I do—it’s what I live for. To help underappreciated librarians like yourself. Poor souls with no one else to turn to.

I’ll spare you the actual filk of the song. (If you want to do it, though, go right ahead!)

It gets worse. Proxy servers and IP ranges are generally handled by the library, often with an assist from campus IT. SSO servers, however, are the domain of campus IT, with zero input whatever from the library. Campus IT does not subscribe to library ethics. Campus IT has different pressures on it than the library—most provosts, chancellors, and deans know better than to openly force the library to compromise privacy; the same is not true of campus IT (which often enough doesn’t give a crap about privacy anyway unless forced to). So the Big Pigs can probably have whatever identifiable data they want just by hinting to campus IT or campus administrators that it’d be useful assessment data to have, and wouldn’t you darling people just love to request it from us? All you need to do is send us the identifier attribute that makes it possible!

Not, frankly, that many academic libraries would resist temptation all that much better.

RA21 must put a stop to this scenario if it is to conform to library privacy norms. IdPs must only send identifier attributes in cases of suspected fraud, and when the case is closed, they must stop sending those attributes. SPs must only request identifier attributes in cases of suspected fraud, and when the case is closed, they must cease the request and discard all identified data. Furthermore, IdPs must never grant SPs access to attributes without the library’s explicit approval, and the library may only ever grant such approval for identifier attributes (or indirect identifiers in excess of what is absolutely necessary) in cases of suspected fraud.

Or, to boil it down to its essentials, try this: Neither IdPs nor SPs may sell, share, or use identifiable data, nor may they attempt reidentification of deidentified data, for any reason except as explicitly stated in (what had better be a new and very greatly improved) RA21. I mean, I suspect this to be a pipe dream; NISO hasn’t the guts to tell assessment and learning analytics to take a long walk off a short pier, and the Big Pigs clearly see assessment and learning analytics as an easy lockpick for the identified-data store. But I can dream, can’t I?

Line 2, column 2 expands on this theme: “In addition to the attributes provided by academic libraries, an additional attribute may be sent to the SP to support specific granular usage analysis or charge back requirements that the IdPs may have.” What does “granular” mean exactly? Who in the IdP may initiate such a request, for what reasons? What else can the SP do with this attribute? Who knows? RA21 doesn’t say. It damned well should.

I do approve of the last line on page 15: “In the case that the IdP sends more attributes than the minimal request from the SP, the SP must not collect or store that data under any circumstance.” Keep that. It’s not enough without a definition of “minimal request,” though—a Big Pig could perfectly well ask for every available attribute and call that minimal.

The rest of the spec is devoted to questions of UX. There isn’t a single additional word about privacy or security that I saw. So there we are. Threat model validated!

NISO, spec authors, the RA21 spec should not be approved, much less implemented, without vastly better data-governance protection for patrons against insider threat from IdPs and SPs than it currently contains. Fix it, please.

Teaching adversarial thinking

In case you missed it: A couple months ago a law prof brought on the wrath of academic Twitter by suggesting that students spend a week eavesdropping on the conversations of others to listen for people betraying their own security and privacy, a thing that people quite commonly do. Some of academic Twitter—self included—was initially entranced, until other parts of academic Twitter asked whether casual snoops (or even not-casual snoops) was really an okay thing to turn our students into? Especially when many of our students are still so unaware of the workings of privilege, such that snooping can take on exceptionally sinister overtones applied to certain populations?

So the initially-entranced folks, self included, backed off our initial enthusiasm, and the furor seems to have mostly died down. I, however, am still stuck with a pedagogical problem: as an instructor in introductory information security, I actually do have to teach people to snoop on, and even attack the privacy and security of, other people and systems. I know that sounds horrifying. I know it does! And it definitely gets into some pretty dark-gray gray areas. But stick with me just a bit longer while I explain.

Over a longish period of information-security work, it’s become clear that the only way to have any confidence at all that a system (in the large sense, so not just “a technological system” but “a technosocial system, emphatically including the people involved or even enmeshed in it”) is secure or private (not, of course, the same thing) is to test it by attacking it:

  • To test whether deidentification suffices to anonymize a dataset (spoiler: it rarely if ever does), researchers try to reidentify one or more people in it, often using additional available data to test reidentification via dataset overlap. See, for example, the Narayanan and Shmatikov paper that doomed the Netflix recommender-system contest.
  • To test the security of a given piece of software, you ultimately try to break it. Yes, there are tools (e.g. “Google dorks,” “vulnerability scanners,” “fuzzers,” even Shodan) to locate obvious or common problems, but they’re not enough. A creative, lateral-thinking human being is much better at finding exploitable holes in code than a computer.
  • To prioritize and test for holes in systems (again, “system” writ large), you first think like an adversary—what are the crown jewels in this system, and how would someone who wants them attack the system? This is called “threat modeling,” and thinking-like-an-adversary is a crucial part of it; without that, you end up with what Bruce Schneier calls “movie-plot threats” while ignoring the gaping system problems right under your nose (as, for example, Equifax certainly did). A crucial insight in threat modeling, captured concisely in this xkcd cartoon, is that your enemies always attack with the simplest method likely to work.
  • And once you have your threat model, you test how well your system resists it by, well, attacking your system in the ways you have identified it to be potentially vulnerable! This often happens in the form of “penetration testing,” which can be done on physical systems, social systems, technological systems (such as networks or software), or any combination of the three. My favorite example of a pentest that goes after all three types of system is this absolutely astounding Twitter thread, which I use in my intro course, and after which I named the class’s messing-around server “Jek.”

So I can’t get around it. If I’m to prepare students to take information privacy and security seriously, never mind enter actual infosec and privacy careers, I have to show them how to think like a Garbage Human (which is how I often phrase it in class), and I have to show them how to attack systems (writ large). How do I do this without turning them into Garbage Humans themselves?

This isn’t exactly a new problem in infosec, of course; the learn-to-defend-by-attacking paradox is the earth out of which Certified Ethical Hacker, CIP{M|T|P}, and similar tech-plus-thinking-about-law-and-ethics certifications grew. It’s not even a new problem generally—if we were to strip academe of everything that could be used to Garbage Human, how much of academe would be left? (Yes, yes, plenty of computer scientists claim that computer science would be left. Those computer scientists are wrong, wrong, wrong, wrong, wrong about that.)

What I ended up doing, because I felt more than a little bad about accepting the law-prof’s assignment idea so uncritically, was going back through my syllabus, assignments, and class slides looking for how I’d approached gray areas and put guardrails around students’ growing potential for Garbage Humanning. What I found fell into an actually rather small number of techniques:

  • Clearly and often laying out stuff that’s either illegal or so Garbage Humanny that it should be. For example, I use altering physical mail envelopes as an analogy to various address-spoofing attacks… but I also explicitly point out that mail tampering is amazingly illegal in the US and they shouldn’t do it. In person in the classroom, I am not at all shy about labeling certain practices Garbage Human territory.
  • Giving copious examples of how real people and organizations have been harmed by attack techniques. I can’t control whether my students use what I teach them to Garbage Human. I can control whether they can reasonably use the excuse “I didn’t know this could hurt anybody!” and I definitely try to.
  • When students in my class perform actual reconnaissance, attack, or forensics maneuvers, they’re doing it on me, on themselves (a good habit to get into! and certainly how I prep any assignment where they’ll be looking at me or my data), or on canned datasets created for the purpose (yes, I use the Greg Schardt/Mr. Evil dataset, for lack of one that’s more recent). They’re not doing it on unwitting and possibly-extra-vulnerable targets. Again, the techniques they’re learning absolutely can be repurposed for Garbage Humanning—but I’m clear that I don’t want them doing that, and I don’t give them any actual practice kicking down.
  • Keeping the emphasis on “attack to defend” throughout. They’re not learning adversarial thinking and attack techniques to turn into Garbage Humans, but to equip themselves to defend themselves, their loved ones, and those for whom they are in some way responsible against the depredations of Garbage Humans.
  • Being open about my own dilemmas vis-à-vis Garbage Humanning. For example, I am unbelievably tempted to pull a Narayanan-and-Shmatikov on the Minnesota learning-analytics dataset, the one from several Soria, Nackerud, et al. publications. Even though I don’t actually have that dataset (and don’t want it, good gravy, what a terrifying responsibility), I’d bet Large Sums of Money that knowing the cohort entry year (which, yes, they published) is enough all by itself to find some folks in the dataset via LinkedIn or a date-bracketed Google dork against the University of Minnesota’s website, and I might even be able to find some folks in their painfully-low-n outlier groups. Possible? Unequivocally. Absolutely without question possible. I’m not even good at reidentification and reconnaissance techniques and I am absolutely sure that I can do this. Ethical? … Well, that’s a tough one, which is why I haven’t actually done it.

Is this enough? I don’t know. I’m certainly still kicking the problem around in the back of my head, because if I can do better than I’m doing, I want to.

Fever

I started this a while ago, but didn’t have anywhere to share it. California’s tussle with Elsevier gave me reason to update it, so here you go. By all means hear it in your favorite singer’s voice. I favor Rita Moreno, myself.

Never know how much I’d read you
Never know how much I’d click
When you straight-up overcharge me
I get a fever makes my budget sick

(chorus below)
You give me fever when you dis me
Fever when you NDA
Fever!… when you Big Pigs
Make me sign the library away

Journal Big Deals in science
Ebooks the humanities
We do their work when they call our name
They still make us pay APCs

(chorus)

Germany has got the fever
Got it from V-S-N-U
Fever isn’t such a rare thing
Sweden’s gone and got it too

(instrumental interlude, for a change)

All those California budgets
Shattered by the NPG
Elsevier tried double-dipping
But they couldn’t get past old UC

(chorus)

Now you’ve listened to my story
Here’s the point of my tirade
Pigs exist to give you fever
To make you broke and lonely and afraid!

They give you fever when they dis you
Fever if you live and learn
Fever when they piss you
Off—an awful way to burn
What an awful way to burn
What an awful way to burn
What an awful way to burn…

Right now bits and pieces of a “Data Breach Tango” are rattling around in the back of my head. We had it comin’! We had it comin’! We had it comin’ all the time…

Libraries no longer guard patron privacy.

It’s National Library Week, and the American Library Association is marking the occasion with a triumphalist paean to libraries. I don’t mind that in and of itself; triumphalism on libraries’ behalf is part of ALA’s job. But so is truth. And there’s a chunk of this paean that’s not true—not just not true, but importantly false.

Stephen Francoeur on Twitter noticed a contradiction between two squares in the self-congratulatory table on the bottom of page 7 (image used by permission):

Six-cell table bragging on libraries, with the middle bottom cell altered to point out a contradiction in two top-row cells

Francoeur is not wrong, of course—libraries can’t both provide a refuge from surveillance and routinely surveil students to (purportedly) establish links between library use and student success. Shame on ALA for not so much as noticing the contradiction there.

My irritation goes deeper, however, and centers on that top right-hand block, which says: “Librarians have long championed their community members’ right to access information privately, and serve as an essential refuge where everyone can check out materials or browse the internet without their information being shared.”

I’m sorry, but the second half of that sentence is flatly untrue. It is a whopper so gargantuan that if whoppers still enlarged noses, the noses of the people who wrote it would span the ocean and we could hang a transoceanic Internet cable off them. American libraries and librarians cannot truthfully claim this any longer, if indeed they ever could. Let me count just a few ways in which libraries do not prevent—and sometimes even foster—information/data sharing about library-information use and Internet browsing.

Abysmal web-privacy hygiene

Sniffing network traffic is so simple I teach my Intro to Information Security students to do it—and this is a class with zero technical prerequisites. If you’d like to try it yourself—do not do this on a public network, though; that’s snooping on people, which is Not Okay—download Wireshark, fire it up, then:

  1. go to the Capture menu and select Options
  2. make sure the “Enable promiscuous mode” box is checked, then OK out of the options menu
  3. type tcp port http into the capture-filter box at bottom (or select it from the handy dropdown menu), and
  4. double-click (most likely) “en0” for the local wireless network.

Congratulations; you’re sniffing unencrypted web traffic. Hit the red-square button when you’d like to stop doing so. If you actually look at what you get when you do this, you’ll notice that it’s the entirety of whatever pages are being browsed, by whoever is on the network. (Unless someone is using a VPN, granted, but if they are, Wireshark shouldn’t be picking up details of their traffic at all.)

The corollary is inescapable: libraries that have not moved their websites to HTTPS—encrypted them—are allowing random Wiresharkers anywhere to snoop on anyone using their websites, not to mention nosy ISPs, ad-mad airlines, and any number of other creeps. Essential refuge? Not even close.

Academic libraries, for the most part, moved to HTTPS fairly quickly once Let’s Encrypt made it less costly and Google started penalizing insecure pages in search results. American public libraries are horrifyingly behindhand, however. At a presentation last October for the Minnesota Library Association, I demonstrated that of the dozen public-library consortia in Minnesota, only half had fully-encrypted websites (with one more working on it). When I redid that work for what was to be an encore webinar in Wisconsin, I found that of our sixteen public-library systems, only six have fully-encrypted websites (with one more working on it). I started doing this work for other states too, but I am as usual overcommitted, so I can’t finish it. If you can, please do! The crowdsourcing spreadsheet I was building is here.

Neither public nor academic libraries make much if any effort to prevent third-party tracking of Internet use from library computers, never mind library-provided wifi. I don’t know of any libraries that routinely have a tracker-blocking plugin installed in library-computer web browsers. (If your library does, do tell me! I will be delighted!) Dealing with wifi is admittedly harder, but the Pi-Hole project demonstrates that it is at least possible. The benefits would not be limited to improved patron privacy; they would also include vastly lessened bandwidth use and a faster, more pleasant user experience for people browsing on mobile devices.

I don’t even know any libraries who set the library-computer default browser search away from Google to DuckDuckGo. This should be an absolute no-brainer! Google is so evil they’re lobbying against anti-audio-surveillance laws!

Incidentally, if you’re wondering how I can possibly be so blunt about this, let me explain. My Minnesota talk went over like a lead balloon. I rather expected that, having tilted at windmills before, but I was honestly surprised that the issue seems to be that I made specific critiques of specific libraries and library systems. Not cricket, apparently, even when the critiques themselves are fully legitimate. Are we that fragile in librarianship? That unwilling to be confronted with our own weaknesses? Well, that’s not good.

When the Wisconsin Library Service asked me to reprise the talk as a webinar as a favor to them, I agreed. I was even more surprised to receive an email from them a week before the webinar date saying that a Minnesota librarian had told them about my earlier talk fiasco, and asking that I remove from the webinar any and all references to specific Wisconsin libraries and library systems. Since I don’t generally hold with censorship, I refused and canceled the webinar.

So I have most of the Upper Midwest disgusted with me already. It can’t get much worse, which oddly enough confers a certain immunity. I might as well try to use that for good.

Anyway, if blocking trackers is an all-around privacy and UX win, why on earth aren’t libraries doing it?

Trackers, trackers everywhere

Because libraries use web trackers, of course. Usability/UX trackers with horrendous privacy-demolishing terms-of-service agreements, like NewRelic or Ad Nexus. Actual marketing trackers—I will not even consider joining the Wisconsin Library Association until Multiview trackers disappear from its website and WLA has the grace to apologize to all of us, members and non- alike, for Multiview ever being there. Google Analytics, ubiquitous in libraries as everywhere else. Even Doubleclick—which is one of the Foul Horsemen of the Adpocalypse—when libraries aren’t careful about how they embed YouTube videos into websites. (Also, YouTube? Even leaving tracking questions aside, YouTube is the actual worst.)

Libraries. Routinely track. Their website visitors’. Website usage. Essential refuge? Try “willing collaborator in surveillance capitalism,” in the Vidkun Quisling sense of the word “collaborator.”

The e-resource vendors from whom libraries purchase access also routinely track patron information use, and are trying to do so more commonly and more identifiably. For a quick explainer around this in the academic-library context, try Cody Hanson’s CNI talk slides from earlier this week. Public libraries are no better; EBSCO, for example, is a wilderness of trackers. What do libraries do about this? Presently nothing whatever, save for the occasional windy grandiose manifesto with no accompanying procedures or implementation plans—empty talk, in other words. Libraries do not forbid vendor tracking of patrons in their content licenses. Libraries do not prevent vendor tracking via their systems.

I mentioned a tracking hygiene issue specific to academic libraries in a prior post: proxy-server logs, which inevitably track and retain identified records of patron information use, are not routinely deleted. Essential refuge my size-11 foot.

Surveillance in assessment practices

But web tracking is just a special case of a more general practice: assessment by surveillance. From the hideous edtech surveillance monsters attacking K-12 education to “customer relationship management” surveillance in public libraries to library learning analytics in higher education—it’s all surveillance, and it all includes surveillance of patron information use.

I’m tired and this rant is long enough already, so I’ll save a longer dissection of assessment surveillance for another day. Suffice to say that just in the realm of library learning analytics, the surveillance rises to toweringly creepy and horrifying levels without even the slightest nod toward consent practices, ethical review is inadequate where it exists at all, and data hygiene and deidentification practices are frankly obscene. (I will at some point emit another rant about deidentification vs. anonymization and why the latter is an utter impossibility—but I’m tired and this rant is long enough already.)

Essential refuge? No refuge at all.

Summation

American libraries no longer guard patron privacy. We can do so again if we choose, but it will take quite a bit of work, not to mention considerably more self-restraint faced with the vastly-increased ease of digital surveillance than we have yet shown.

Until then, we need not to lie about it. Lying is a Facebook trick, as I described at length in a talk I gave yesterday for the local IT Leadership Conference (and promised to post here, so now I have). We should be above lying, we who prize truth.

Unizin Not-Common-Knowledge Data Model

I’m doing a talk this Tuesday for a campus IT conference. Should be a good time, for certain values of that phrase. I’ll post a link to the slides here afterwards.

While writing the talk—I’m one of those dorks who does script out talks word-for-word, though I do it in the slangy, choppy rhetorical style I actually talk in; academese is not my speech register and I don’t pretend it is—I ran across the Unizin Common Data Model, which if I understand correctly underlies the giant data warehouse for student data called the Unizin Data Platform. This will hold data from all Unizin member institutions.

To Unizin’s credit, they have a data dictionary publicly available, though every time I’ve tried to get just the table listing (or ERD?) it hasn’t worked. Still, the list of column/attribute names is there, and this list is a swift and daunting education in student Big Data.

See for yourself by all means, but here are some specific areas of the table I suggest looking at:

  • Everything from the Incident and IncidentPerson tables (conveniently, the table name is the first column in the data dictionary and is how the dictionary is ordered), especially the RelatedToDisabilityManifestationInd column
  • the LearnerAction and LearnerActivity tables, noting for the record that hashing the LearnerID is not anything like a sufficient privacy guarantee
  • the Person table and related tables, which are detailed to an extent that gives me nightmares

Have fun asking yourself why on earth a learning-management system needs to know all this… and considering the Equifax-level horror if there is ever a breach in it.

Kanopy and Elsevier: united in password mishandling?

My introductory information-security course contains both undergraduates and iSchool graduate students. Every once in a while I get to drop in a library- or archives-specific tidbit, and today (the first class meeting after Spring Break), I had two among all the other news:

Shortly after the Kanopy breach broke, Jessamyn West passed on a very important question from Dan Turkel to Kanopy on Twitter: “Are you [Kanopy] storing user passwords in plaintext?”

Let’s back up and examine that question a moment, shall we?

“Plaintext” is information-security jargon for “not encrypted.” “Encrypted,” for our purposes, means “changed such that the original data cannot easily (or ideally at all) be figured out.” So, when Elsevier actually broadcast passwords in plaintext to all and sundry via some web dashboard, it disobeyed one of the fundamental best practices in infosec. If Kanopy was storing its passwords in plaintext, that’s just as bad.

(How do you know if a user’s password is correct, if you can’t store it figure-outably? Well, you know exactly how you changed it. When the user enters their password, you just change it the same way you originally changed the stored password, at which point you can compare the results.)

Nobody is supposed to store passwords in plaintext! Ever! Much less broadcast them in plaintext to all and sundry on a web dashboard! (What you are supposed to do with them is… complicated, and keeps changing as password-cracking software and hardware improves. Check with your favorite infosec expert, okay? And consider multi-factor authentication.) So what Turkel was asking Kanopy boils down to “okay, you were caught being careless; exactly how careless were you?”

Kanopy never answered, at least not on Twitter. This… does not exactly inspire confidence. Nor has Elsevier’s post-incident public relations on Twitter, which as best I can tell has substantially amounted to “it wasn’t that bad!” “everybody else has breaches too!” and similar sad, disingenuous deflections of responsibility. There are best practices in handling security incidents—perhaps unsurprisingly, infosec refers to them by the term “incident response.” These are not them.

I hope to have more to say about incident response in time, because it’s a thing more libraries will find themselves stuck doing—including when our vendors should but don’t—and the first step is always “have a plan for it.”

Authentication and the privacy of e-resource use

I haven’t been a library sysadmin or an e-resource librarian—missed my chance right out of library school, though I don’t precisely regret it as the library in question was somewhat troubled at the time (it has recovered admirably since). So if I goof up anything here, which I am likely to do, correct me on Mastodon. Still, I am definitely seeing a need for a How E-Resource Authentication Works explainer aimed at people who, like me, don’t worry about authentication for a living.

So, here’s the basic problem. Dr. Scientist wants to get access to some e-resources behind a paywall. The paywall needs to know if Dr. Scientist is allowed to do that. Dr. Scientist’s reading habits are properly none of anyone’s business, as long as the paywall has gotten its money.

Dr Scientist saying "Let me through!" to the paywall, which answers "Why should I?"

One way this can work is if Dr. Scientist establishes an account directly with the paywall, paying whatever the paywall demands—and giving the paywall plenty of personal information about her at signup, no doubt. Each time Dr. Scientist logs in with the paywall, then, the paywall can observe her search and download behavior—even her actual reading behavior if it presents articles in web-buggable HTML rather than PDF or whatever that horrible “enhanced PDF” garbage is that some paywalls are trying to shove on us. The paywall can also load its HTML pages with all the third-party trackers it wants. Dr. Scientist’s privacy from that paywall? Essentially nil.

It’s not coincidence that a number of paywalls have been pushing this model at developed-world Dr. Scientists pretty hard of late. “Bypass your tedious old library—get direct access!” Running roughshod over reader privacy represents a brand-new revenue stream for paywalls—and paywalls, unlike academic libraries, have made no public commitments not to.

The reason it isn’t typically a revenue stream already (pace the presence of web trackers on publisher websites) is that the vastly more common access model has the library in the middle. The library pays the paywall, the library makes sure that its patrons can get through the paywall to the stuff they want.

Dr. Scientist tells the library "Make the paywall let me through! You know me; I'm Dr. Scientist."

In this scenario, the communication between Dr. Scientist, library, and paywall can take place in a surprising number of ways. Here are a few.

IP address-range recognition

If Dr. Scientist is on campus, her computer will have an IP address assigned by campus IT. The library can send a list of campus’s IP ranges (that is, clusters of IP addresses) to the paywall. When the paywall sees Dr. Scientist’s request, it notices that the IP address belongs to campus and allows Dr. Scientist through without further authentication. The paywall doesn’t know who Dr. Scientist is at this point without doing extra work! It just knows that she comes from a campus that has paid the paywall.

Dr. Scientist’s privacy from the paywall under this model: fairly good, though not inviolate. The paywall can track and record what Dr. Scientist does on it (her “session”), and if the session is extensive enough or Dr. Scientist’s interests unusual enough, it might be able to figure out who she is that way. If Dr. Scientist’s office computer has a static (unchanging) IP address—as my work laptop does when it’s connected to my office’s Ethernet cable—her sessions there can be compiled into a fuller dossier on her. If the paywall is connected up with third-party trackers and data brokers, the chances of reidentifying nominally-“anonymous” Dr. Scientist skyrocket.

But Dr. Scientist likes to do her current-awareness reading in the local coffeeshop, whose wireless is not part of any campus IP range. So IP address-range recognition won’t always work to get her past paywalls.

Proxy servers

To fix this problem, academic libraries typically run what’s called a “proxy server,” a bit of software that asks Dr. Scientist to authenticate herself as belonging to campus, and then trots over to the paywall to get Dr. Scientist what she wants to read. Since the proxy server’s IP address is campus-based, the paywall lets it through (although there are other ways this can work, ways I find too obtuse to explain).

Dr. Scientist’s privacy here is tolerable—again, the paywall does not know who she is, and her requests will be mingled with all the proxy server’s other requests on behalf of the rest of campus, which is even better for her privacy—but there is a notable weak spot in it: server logs. Proxy servers keep logs! These logs absolutely do connect Dr. Scientist with exactly what she read.

Why do libraries let proxy servers keep logs, if librarians are so fired up about privacy? Excellent question. The answer explains why I personified the library as a besworded guard instead of, say, an usher or a maitre d’. If Dr. Scientist suddenly decides to do a text mining project, starting with an automated download of the entirety of the paywall’s contents from her favorite coffeeshop, the paywall will notice and will lean on the library. It’s then on the library to locate Dr. Scientist and tell her to knock it off, ideally before the paywall slams shut over all of campus. Without the proxy-server logs, the library can’t do that.

(Dear Dr. Scientists and Dr. Digital Humanists everywhere: Don’t. Effing. Start. Text-mining projects. On the literature. Without talking to. THE LIBRARY. First. I have heard the above story so many times from so many librarians at so many campuses about so many Dr. Whoevers in so many different disciplines. Knock it off, you overcredentialed yutzes. Librarians have better things to do than ride herd on you.)

How long are proxy-server logs retained? At entirely too many libraries, entirely too damned long. I have a reliable report of one library, when this question was asked, answering “Oh, they don’t take up much space; we keep them indefinitely.” This is appallingly poor privacy practice and that library should be ashamed of itself. Dump logs routinely, ideally via cron job or similar automation. Personally, I wouldn’t keep proxy-server logs longer than a day unless there was a reason to, but I suppose I could be talked into a week or so. (Any paywall that takes a whole week to lean on the library about a downloading problem deserves what it gets.)

By the way, any library using proxy-server logs in assessment or learning-analytics projects without deidentification and a whole lot of fuzzing should have its librarians sentenced to walking the halls at a major national library conference wearing sandwich boards proclaiming “WE TRASHED OUR PATRONS’ INTELLECTUAL FREEDOM.” If the proxy-server logs were not only not deidentified, but actively correlated with other sources of campus data about individuals, those sandwich boards should be lined with burlap and weighted with many pounds of lead. Not even slightly okay.

Hm, this post is long enough, I think, and I need to do a little more research before I continue, so I’ll tie it off here for now. Next up: single-sign-on and RA21!

(All clip art from Open Clip Art. I am very, very bad at art.)

Welcome to Tattle Tape!

It’s not pretty (I have a lot of CSS work to do), and it’s not really ready for prime time, but here we go anyway. Hi, I’m Dorothea Salo, and welcome to Tattle Tape.

Tattle Tape is a blog that will (mostly) be about patron and staff privacy in library contexts. (I’m a noisy person with a lot of professional interests. I know myself pretty well. I will blog about other things now and then. Might as well admit that to myself and everyone else.) It’s named after magnetic strips placed in library books that set off security gates, which tattle on possible thieves of library materials.

A few things are worth setting out at the, as it were, outset, by way of expectations management:

  • Tattle Tape is my blog. I’m not blogging at anybody else’s behest. What you read here is on me and nobody else—especially not any employer I have ever had or any project with which I have ever been associated.
  • Comments are off on most posts. I may occasionally open them. I can be found in the fediverse (Mastodon etc.) at @dsalo@digipres.club for those who are so inclined.
  • I am not nice. I have never been nice. I never will be nice. Nice does not appear among my goals for Tattle Tape. In particular, I have every intention of naming names with respect to publicly-perceptible privacy issues in libraries. If you published it, disseminated it publicly online (as with presentation slides, whitepapers, or whathaveyou), or I can find it out from your website by something as simple as clicking on UBlock Origin or Privacy Badger, it is fair game for Tattle Tape. (That said, I’m not a pentester or webapp bughunter, and if I were, I certainly wouldn’t do it for free. Your secrets are safe from me.)
  • Speaking of which: Part of the CSS work I need to do is getting rid of Google Fonts use in this blog theme. (I will do my best to get that taken care of this week.) Beyond that, other than ordinary Apache logging which I pay zero attention to and delete regularly, there is no logging or other tracking on Tattle Tape (nor the site it is part of), nor will there ever be unless someone hacks the site. I’m a librarian. My library school learned me real good about privacy. I don’t sell, much less donate, my readers to Big Data.

I can’t promise much by way of posting schedule. I’ll spare you the boring litany of busy, but yes, I’m overcommitted and barely staying afloat just now, and that doesn’t look like easing up real soon.

I know some useful things, though, and when I can make time, I’ll share them. Welcome to Tattle Tape.