Some tidbits about data handling in library learning analytics

So I finished building and coding up my dataset of library learning-analytics articles! That was a lot of work. I also have a data dictionary, and a methods section in the article draft! Yesterday I got to start writing queries against the database and writing them up in the article draft.

Want some tidbits about the 62 research projects I ended up studying (46 of them American)? Of course you do. Have some:

  • 35 of 62 projects, 27 of those 35 American, made no attempt whatever to deidentify data before analysis. Ahoy ahoy potential data leaks and insider threat!
  • 11 projects, 8 American, used data that revealed the subject of a patron’s inquiry. That’s a pretty bright-line no-no in libraries, folks.
  • Only 11 projects notified students about the specific research that would be taking place using their data. One more claimed that students were notified because the campus ID card terms of service told them research (unspecified) would be happening.
  • Actual informed consent? Five. Five projects sought it. Out of 62.
  • Wondering where ethics review was in all this? Yeah, me too. Of the 46 American projects, eleven passed IRB review, four were declared exempt, and a big fat nothing for the remainder. One of the 16 non-American projects received ethics review.
  • Sensitive data used in these projects included: socioeconomic status data or proxies thereof (13 projects), high-school performance data (GPAs and SAT/ACT-or-analogue scores, 13 projects), location data (7 projects), first-generation student status (6 projects), national origin or citizenship data (4 projects), military/veteran status (3 projects), and disability status (1 project).

I’ve got more; I wrote plenty of ANDed WHERE clauses yesterday (SQL is so much fun!), and more may occur to me as I continue the writing-up. But the above certainly gives you the flavor. It is not a good flavor.

Please miss me with all the gaping loopholes in which projects must receive ethics review. I know. That’s part of the problem! I plan to write about it at length in the other paper! (I also want to acknowledge David Fiander for giving me lots of useful intel on Canadian ethics-review loopholes yesterday on Mastodon. Appreciate it, David, and I’ll also acknowledge your help in one or both papers.) It may seem convenient to dodge all this red tape, but in my head what it really means is that LIS is letting its researchers show their ethics underwear all over the place, unguided and (crucially) unprotected. It’s not the Value Agenda for Libraries pushers whose careers will be tarnished when (and it’ll be when if I have anything to say about it) retractions and expressions of concern start happening; I expect they’ll claim it’s on researchers to Do Ethics Right, none of their concern. It’s pretty much academic librarians doing what VAL pushers told them was okay—not just okay, vitally important—who will be hung out to dry.

Not sure how the VAL pushers sleep at night, honestly—if my analysis holds water, which I think it does or I wouldn’t still be working on it, they’ve royally screwed students and librarians—but I suppose that’s not my problem.

Anyway, a lot of the discussion for this piece will be the first (as far as I know) attempt at examining real-world library learning-analytics practices in light of what we know from Data Doubles and similar research (which there’s rather more of now! yay!) about student preferences, the top two of which have repeatedly been shown to be notification and the chance to consent (or not). There’s an ethics-of-care argument there that I’m happy to make: if we care about students as much as we claim to, ignoring or overriding their stated preferences, especially for a research agenda that does not directly benefit them (hello benevolence! the Value Agenda for Libraries has none of you!), cannot be ethically acceptable.

I’ll publish the data, too. Zotero exports for both eligible and ineligible project citations, SQLite database, CSV database exports (though I need to think about building useful views for later-researcher convenience), basically the lot. CC0 on all of it, not that there’s much if any copyright in it to claim. You want to play in my data playground? Go for it.

DRAFT: A privacy agenda for IMLS’s strategic plan

The Institute for Museum and Library Services is an incredibly influential funder of US library praxis and research. They are embarking on a strategic-planning process on which they have invited input (deadline August 6).

Disclaimer time: Work I’ve participated in has been directly funded by IMLS twice, once for PROUD and PRAVDA, once for Data Doubles.

A draft of the response email I intend to send appears below. Feel free to kibitz on Twitter; I want to let it stew in my own head for a few days before I send it. Feel free also to adapt what I’ve written here for your own comments to IMLS.


This email represents me as an individual, not the UW-Madison iSchool, the UW-Madison School of Computer, Data, and Information Sciences, or the IMLS-funded Data Doubles project. I appreciate the opportunity to make strategic suggestions for IMLS, whose programs are absolutely vital to American GLAM institutions.

First, I would be remiss not to thank IMLS and its staff for its past and present efforts to advance privacy in American libraries. The Data Doubles project is naturally uppermost in my mind as I’m a co-investigator on it, but I am also very impressed with the recent Safe Data Safe Families effort led by the University of Maryland. Thank you very much, IMLS, for your attention and assistance.

I believe IMLS should make privacy (and relatedly, information security) standards, education, praxis, and research a cornerstone of its next strategic plan. Such an agenda should accord well with current federal legislative and regulatory priorities around information security and data privacy. Outlining a few of the opportunities I see:

  • Broadening participation: Education and training on privacy, infosec, and personal/behavioral data literacy is a desperate need across practically all of American society, from youth on up and across all identities. How can American citizens appropriately influence the ongoing development of state and federal privacy law until a trusted institution like the American public library explains what is at stake? Data Doubles research revealed critically low privacy and security awareness among its respondents; how can college students push back against their schools or libraries harming them through their data if they do not understand their school’s or library’s data practices and how those practices can cause harm?
  • Essential skills: Professional development around privacy and infosec among GLAM workers is also desperately necessary. Libraries and archives have far fewer security and privacy-aware workers—even on the fairly basic awareness level I lead my own learners to—than they need. For example, the K-12 sector, hard-hit by ransomware, needs as much infosec help as it can get; could trained school media specialists be a source of that help? And how can academic librarians evaluate the impact of (just as one example) single-sign-on systems on patron privacy without the sociotechnical background to understand how these systems work?
  • Outreach and partnerships: An eye-opening result of the Data Doubles investigations is that respondents had a seriously inaccurate sense (when they had any sense at all) of library privacy commitments; moreover, in survey results they trusted librarians with their data rather less than they did other campus actors. This hints at a serious communication failure: librarians have not made our privacy stance sufficiently clear to the American public. Not only does this unnecessarily inhibit patron trust in libraries, it discourages individual libraries from making and sticking to strong privacy commitments (as my recent adventure in retrieving my own circulation data [1] demonstrates). I would love to see IMLS fund additional privacy-centric library outreach.
  • Practices and tools in collection management: The ability of software, service, and content vendors to surveil patron information use has vastly outstripped the GLAM sector’s ability to assess and respond to surveillance risks. Could IMLS fund rigorous security and privacy assessments by experts? How about privacy-aware, privacy-by-design standards development?
  • Research: My remarks above hint at useful research agendas, especially around privacy and security in GLAM software, systems, and services. I believe, however, that IMLS has a near-unique research-related privacy opportunity, even responsibility: building privacy standards and expectations for library human-subjects research that respect canonical library ethical commitments to privacy, and educating LIS researchers to respect such standards.

What is now the Digital Library Federation Privacy and Ethics in Technology group outlined in 2018 how neither the Common Rule nor Institutional Review Boards nor present-day data governance structures and processes protect library patron data sufficiently to accord with library-specific privacy ethics commitments [2]. In the absence of library-specific research-ethics guidance (much less incentives) from any library professional organization or major grant funder, privacy-invasive research practices such as noticeless and/or consentless information-use surveillance, dragnet surveillance of the information behavior of large groups of patrons, offering patrons no way to opt out of research into their information behavior, patron data sharing (including identified patron data) beyond library walls, and patron-data mashups with non-library data sources have crept largely unchallenged into the LIS literature [3], often accompanied by poor statistical analysis methods [4] and poor data-management practice [5].

IMLS can put a major dent in privacy-invasive research practices by establishing and publishing patron-data privacy and security standards and expectations for the LIS and GLAM researchers IMLS funds. The ethical responsibilities and operationalizations thereof laid out in the original Belmont Report [6] conveniently organize and contextualize what some of those expectations might look like:

  • Respect for persons: no research involving identified or reidentifiable patron data without informed consent by all research subjects; library patron data both individual and aggregated must remain within the library and cannot be mashed up with other sources of data about patrons; proper data management procedures, including but not limited to rigorous data security and deidentification, must be described in applications and adhered to by funded research projects
  • Beneficence: research must offer a clear benefit to library patrons in general and research subjects in particular, not only to the researcher(s) or the library (a test often failed by “library value” research)
  • Justice: research methods described in grant applications must acknowledge and (insofar possible) mitigate opportunities for bias; research must not exploit unequal power relations between researcher(s) and research subjects; research must not contribute to additional surveillance harms accruing to already-oversurveilled populations, especially minoritized populations; patron data too sensitive or invasive to collect or analyze in a physical-library context must also be considered too sensitive or invasive to collect or analyze online
  • Informed consent: required for all research subjects, without exception and independently of decisions by relevant IRBs
  • Assessment of risk and benefits: grant applications must enumerate security and privacy risks to research subjects and to library patrons more generally, weighing them against benefits to those subjects from the research and not forgetting to assess harms to minoritized populations; application reviewers must assess these risks and benefits in reviews
  • Selection of subjects: dragnet surveillance methods, especially without notice or consent, should not be funded by IMLS

Obviously IMLS needs far more input than just mine to establish these standards! Fortunately, IMLS has funded privacy and security ethicists, as well as patron advocates, that it can now tap to lend their expertise. I would expect standards promulgated by IMLS to have impact far beyond IMLS-funded research (a weighty consideration in its own right). Other GLAM funders might well follow in IMLS’s footsteps, and LIS editorial boards might adopt IMLS standards for their publications.

I happily repeat that I very much appreciate IMLS’s attention to privacy issues. I wish IMLS all good fortune in this strategic-planning process.

Sincerely,
Dorothea Salo

[1] University of Wisconsin Circulation and E-Resource Access Records, https://osf.io/2axkn/

[2] Ethics in Research Use of Library Patron Data: Glossary and Explainer https://osf.io/xfkz6/

[3] Jones et al., “A comprehensive primer to library learning analytics practices, initiatives, and privacy issues.” (C&RL, https://doi.org/10.5860/crl.81.3.570)

[4] Asher and Robertshaw, “Unethical numbers? A meta-analysis of library impact studies” (preprint: https://ir.library.oregonstate.edu/concern/articles/qn59q8754)

[5] Briney, “Data management practices in academic library learning analytics: a review” (JLSC, https://doi.org/10.7710/2162-3309.2268)

[6] Belmont Report, https://www.hhs.gov/ohrp/regulations-and-policy/belmont-report/read-the-belmont-report/index.html

Get out! GET OUT!

I liked John Legend’s portrayal of Jesus in NBC’s Jesus Christ Superstar, but there was one moment I really needed a heavy-metal scream from him and didn’t get it. (As a sometime chorister, I understand! Heavy-metal screams are tough on the throat.)

In this Big-Data-brownnosing Chronicle puff piece, the movers behind Unizin and its grossly invasive Big Data platform make plain what lots of us (self included) suspected: this is an Orwellian-telescreen Bentham-panopticon play.

But as important or more, says Hilton, is what can be learned from Unizin’s capacity for data collection and analysis. Members have access to the online-learning platform Canvas and its associated data — not only from each member institution but also collective, anonymized data. That will increasingly enable discovery of broad trends from the overall membership, as well as micro-information for every class. Combine that with data on students’ demographics and background, Unizin says, and you have a powerful tool to make predictions, spot problems, and improve pedagogy.

“Anonymized” data isn’t, not at the level of detail Unizin is planning. “Micro-information for every class” means labor surveillance. The default is surveillance-on, of course, and it may well be the only option. As for Brad Wheeler? I’ve seen his big ideas before. One thing I personally, in-person, my own self, noticed is that he steamrolls everyone he can—especially women. (Cis male higher-ed educators in Unizin shops, you need to speak up, and the time you need to speak up is now.)

I am heavy-metal screaming right now. My classroom should be a house of learning! But you have forced in these data thieves!

Get out. GET OUT!

Libraries no longer guard patron privacy.

It’s National Library Week, and the American Library Association is marking the occasion with a triumphalist paean to libraries. I don’t mind that in and of itself; triumphalism on libraries’ behalf is part of ALA’s job. But so is truth. And there’s a chunk of this paean that’s not true—not just not true, but importantly false.

Stephen Francoeur on Twitter noticed a contradiction between two squares in the self-congratulatory table on the bottom of page 7 (image used by permission):

Six-cell table bragging on libraries, with the middle bottom cell altered to point out a contradiction in two top-row cells

Francoeur is not wrong, of course—libraries can’t both provide a refuge from surveillance and routinely surveil students to (purportedly) establish links between library use and student success. Shame on ALA for not so much as noticing the contradiction there.

My irritation goes deeper, however, and centers on that top right-hand block, which says: “Librarians have long championed their community members’ right to access information privately, and serve as an essential refuge where everyone can check out materials or browse the internet without their information being shared.”

I’m sorry, but the second half of that sentence is flatly untrue. It is a whopper so gargantuan that if whoppers still enlarged noses, the noses of the people who wrote it would span the ocean and we could hang a transoceanic Internet cable off them. American libraries and librarians cannot truthfully claim this any longer, if indeed they ever could. Let me count just a few ways in which libraries do not prevent—and sometimes even foster—information/data sharing about library-information use and Internet browsing.

Abysmal web-privacy hygiene

Sniffing network traffic is so simple I teach my Intro to Information Security students to do it—and this is a class with zero technical prerequisites. If you’d like to try it yourself—do not do this on a public network, though; that’s snooping on people, which is Not Okay—download Wireshark, fire it up, then:

  1. go to the Capture menu and select Options
  2. make sure the “Enable promiscuous mode” box is checked, then OK out of the options menu
  3. type tcp port http into the capture-filter box at bottom (or select it from the handy dropdown menu), and
  4. double-click (most likely) “en0” for the local wireless network.

Congratulations; you’re sniffing unencrypted web traffic. Hit the red-square button when you’d like to stop doing so. If you actually look at what you get when you do this, you’ll notice that it’s the entirety of whatever pages are being browsed, by whoever is on the network. (Unless someone is using a VPN, granted, but if they are, Wireshark shouldn’t be picking up details of their traffic at all.)

The corollary is inescapable: libraries that have not moved their websites to HTTPS—encrypted them—are allowing random Wiresharkers anywhere to snoop on anyone using their websites, not to mention nosy ISPs, ad-mad airlines, and any number of other creeps. Essential refuge? Not even close.

Academic libraries, for the most part, moved to HTTPS fairly quickly once Let’s Encrypt made it less costly and Google started penalizing insecure pages in search results. American public libraries are horrifyingly behindhand, however. At a presentation last October for the Minnesota Library Association, I demonstrated that of the dozen public-library consortia in Minnesota, only half had fully-encrypted websites (with one more working on it). When I redid that work for what was to be an encore webinar in Wisconsin, I found that of our sixteen public-library systems, only six have fully-encrypted websites (with one more working on it). I started doing this work for other states too, but I am as usual overcommitted, so I can’t finish it. If you can, please do! The crowdsourcing spreadsheet I was building is here.

Neither public nor academic libraries make much if any effort to prevent third-party tracking of Internet use from library computers, never mind library-provided wifi. I don’t know of any libraries that routinely have a tracker-blocking plugin installed in library-computer web browsers. (If your library does, do tell me! I will be delighted!) Dealing with wifi is admittedly harder, but the Pi-Hole project demonstrates that it is at least possible. The benefits would not be limited to improved patron privacy; they would also include vastly lessened bandwidth use and a faster, more pleasant user experience for people browsing on mobile devices.

I don’t even know any libraries who set the library-computer default browser search away from Google to DuckDuckGo. This should be an absolute no-brainer! Google is so evil they’re lobbying against anti-audio-surveillance laws!

Incidentally, if you’re wondering how I can possibly be so blunt about this, let me explain. My Minnesota talk went over like a lead balloon. I rather expected that, having tilted at windmills before, but I was honestly surprised that the issue seems to be that I made specific critiques of specific libraries and library systems. Not cricket, apparently, even when the critiques themselves are fully legitimate. Are we that fragile in librarianship? That unwilling to be confronted with our own weaknesses? Well, that’s not good.

When the Wisconsin Library Service asked me to reprise the talk as a webinar as a favor to them, I agreed. I was even more surprised to receive an email from them a week before the webinar date saying that a Minnesota librarian had told them about my earlier talk fiasco, and asking that I remove from the webinar any and all references to specific Wisconsin libraries and library systems. Since I don’t generally hold with censorship, I refused and canceled the webinar.

So I have most of the Upper Midwest disgusted with me already. It can’t get much worse, which oddly enough confers a certain immunity. I might as well try to use that for good.

Anyway, if blocking trackers is an all-around privacy and UX win, why on earth aren’t libraries doing it?

Trackers, trackers everywhere

Because libraries use web trackers, of course. Usability/UX trackers with horrendous privacy-demolishing terms-of-service agreements, like NewRelic or Ad Nexus. Actual marketing trackers—I will not even consider joining the Wisconsin Library Association until Multiview trackers disappear from its website and WLA has the grace to apologize to all of us, members and non- alike, for Multiview ever being there. Google Analytics, ubiquitous in libraries as everywhere else. Even Doubleclick—which is one of the Foul Horsemen of the Adpocalypse—when libraries aren’t careful about how they embed YouTube videos into websites. (Also, YouTube? Even leaving tracking questions aside, YouTube is the actual worst.)

Libraries. Routinely track. Their website visitors’. Website usage. Essential refuge? Try “willing collaborator in surveillance capitalism,” in the Vidkun Quisling sense of the word “collaborator.”

The e-resource vendors from whom libraries purchase access also routinely track patron information use, and are trying to do so more commonly and more identifiably. For a quick explainer around this in the academic-library context, try Cody Hanson’s CNI talk slides from earlier this week. Public libraries are no better; EBSCO, for example, is a wilderness of trackers. What do libraries do about this? Presently nothing whatever, save for the occasional windy grandiose manifesto with no accompanying procedures or implementation plans—empty talk, in other words. Libraries do not forbid vendor tracking of patrons in their content licenses. Libraries do not prevent vendor tracking via their systems.

I mentioned a tracking hygiene issue specific to academic libraries in a prior post: proxy-server logs, which inevitably track and retain identified records of patron information use, are not routinely deleted. Essential refuge my size-11 foot.

Surveillance in assessment practices

But web tracking is just a special case of a more general practice: assessment by surveillance. From the hideous edtech surveillance monsters attacking K-12 education to “customer relationship management” surveillance in public libraries to library learning analytics in higher education—it’s all surveillance, and it all includes surveillance of patron information use.

I’m tired and this rant is long enough already, so I’ll save a longer dissection of assessment surveillance for another day. Suffice to say that just in the realm of library learning analytics, the surveillance rises to toweringly creepy and horrifying levels without even the slightest nod toward consent practices, ethical review is inadequate where it exists at all, and data hygiene and deidentification practices are frankly obscene. (I will at some point emit another rant about deidentification vs. anonymization and why the latter is an utter impossibility—but I’m tired and this rant is long enough already.)

Essential refuge? No refuge at all.

Summation

American libraries no longer guard patron privacy. We can do so again if we choose, but it will take quite a bit of work, not to mention considerably more self-restraint faced with the vastly-increased ease of digital surveillance than we have yet shown.

Until then, we need not to lie about it. Lying is a Facebook trick, as I described at length in a talk I gave yesterday for the local IT Leadership Conference (and promised to post here, so now I have). We should be above lying, we who prize truth.