She can’t say that, can she?

I recently had the honor of giving a keynote for this year’s Library Publishing Forum. The Forum was online this year—they’re trading off in-person and online years, which I think is a wise way to proceed—so I was able to attend the whole thing, and I enjoyed myself hugely.

My slides and talk notes are on my Speakerdeck. If you’d rather watch the talk recording, the Library Publishing Coalition has it on their YouTube:

A quick-and-dirty review of Metadata for Digital Collections, 2nd ed.

I was incredibly grateful to the libraries’ interlibrary loan folks during pandemic lockdown for getting me a scan of chapter 10 (“Designing and Documenting a Metadata Scheme”) of the first edition of Steven J. Miller’s Metadata for Digital Collections. I can and do bricolage a lot of things in my syllabi, but there just is no substitute for this chapter, not anywhere.

That said, that was the only chapter of the book I used during the pandemic, because its first edition, while fine for its day, was getting (shall I say) rather long in the tooth, such that bricolage offered better options. (I am not the sort of instructor who rushes reflexively to pick out a textbook anyway. I am extremely opinionated about instructional design; many textbooks rub me the wrong way. Some, of course, are just garbage from the word go.) So I definitely perked up my ears (and asked for a review copy) when the second edition came out last year.

Verdict: Not perfect, but aside from one truly distressing and Seriously Not Okay topic omission (discussed below), pretty good. Good enough that yes, I’m going to require it despite its cost and the issues I have with it. The unity of voice, careful audience-aware prose, and exceptionally useful apparatus (glossary, crosswalks, plentiful sample records) are often enough an improvement over my usual bricolage. The irreplaceable Chapter 10 of the first edition is Chapter 12 of the second, and it’s as irreplaceable as ever.

That said, it ain’t perfect. Notable imperfections, one major, the rest minor:

  • There’s no discussion whatever—seriously, not a single word—about inclusion in metadata or controlled vocabularies, which these days is… well, straight-up, it’s unacceptable. (Cis white male authors gonna cis white male, I fear. Where were the reviewers and editor here? This shouldn’t have got past them.) Bricolage can fill this gap, however; there’s plenty of brilliant work out there.
  • I’m not sad METS is gone (was it in the first edition, even? I don’t recall), but I am a little sad there’s not a practical discussion of how systems handle file chunking and ordering, because they do, they just mostly do it with filenames or system-specific setups, not metadata. A brief mention of PCDM and the Oxford Common File Layout might not go amiss, even.
  • Calling the RDF serialization N-Triples “a subset of Turtle” is accurate while conveying absolutely no useful information whatever. What students need to know is that N-Triples is a commonly-available output serialization for stored RDF online—if you click on a random website link to a linked-data representation of something, you’re most likely getting N-Triples—and its syntax is very verbose and restrictive.
  • There are a few relatively minor technical gaffes, nothing dealbreaking. (The XML declaration is optional, y’all! An XML document without it should parse just fine.)
  • I think the coverage of Qualified Dublin Core is somewhat out of proportion to its actual importance in the field.
  • I think (based on a lot of classroom experience) that the book mis-sequences RDF and XML. The grotesquely clunky design of XML namespace declarations is a lot easier for students to assimilate once they’re used to Turtle’s cleaner, clearer @prefix declarations. That said, Miller is careful enough to sandbox chapters in this book that it’s quite possible to assign the RDF chapter out-of-sequence, and that is what I plan to do.

My most longwinded (sorry) issue with the book is its near-obsessive OCLC bootlicking. In this our year 2023, ContentDM is the worst available digital-collections demo option for a textbook; it’s an obsolete proprietary crap sandwich, never mind that it belongs to likely the most evil, odious pseudo-non-profit in libraryland—and believe you me, there are several contenders for that dubious crown. Of the remaining ContentDM customers I know, not one is happy with it; most are meditating or actually implementing a migration off it. Omeka exists (and is far simpler for a harried instructor to let the class kick the tires on than ContentDM is—it’s a one-click install at my webhost). AtoM exists. Hell, DSpace exists. There are even a couple-three new contenders (one of which I actually plan to take a look at). The choice to advertise ContentDM is just embarrassing. Similarly, the choice to feature OCLC’s little-known, barely-maintained, and largely-useless OAIster.

In short: mostly-well-done update, solid choice for a metadata classroom; needs a chapter on inclusion, a few minor fixes, and the total cessation of toadying to OCLC.

Tell ACRL to fix its assessment guidelines

(The below is a lightly-edited and augmented Mastodon thread. Please forgive prolixity and redundancy.)

HEY, anti-surveillance librarian-type folks! HEY, anti-surveillance higher-ed-type folks! ACRL is trying to ethicswash surveillance assessment again! You can read how here. You can comment, too! I gave them several earfuls. If you feel so moved, comment yourself.

First, a bit of context for the non-librarian-types. ACRL is the Association for College and Research Libraries. It is a division of the American Library Association focused specifically on academic-library stuff. It has put out guidelines on various matters previously; this is a thing it regularly does. So a set of guidelines on assessment is quite in-scope for them.

Here’s the problem, from where I’m sitting. In 2010, ACRL went all-in on “library learning analytics,” or as I like to call it, “surveillance assessment” (and steal that term all you want, no credit needed, I want it to become common discourse). Here’s where. The so-called “value agenda” has lots of moving parts, but one of them is (my paraphrase, but I have some receipts): “grab all the data on students and use it to the hilt to show how awesome libraries are.” By all means go read the report and see if you get a different impression; I’ll wait.

If any of you are on IRBs or otherwise know some things about research ethics or power, your alarms are already tripping, right? Researchers lab-ratting other people for their own benefit goes real nasty places real fast. And in practice so far, surveillance assessment of students is the only piece of the value agenda that has gotten any traction in libraries. The value-agenda folks have told on themselves time and time again. My favorite example is the table of contents to this 2018 report, which posits privacy as an obstacle to learning analytics. When. They. Tell. You. Who. They. Are. Believe. Them.

There are definitions of learning analytics—most definitions, actually—that posit a priori (and without proof, or indeed much evidence) that it is for the benefit of students. The Value Agenda for Libraries has never claimed this. It has only ever held promoting libraries as a goal. So one thing y’all can tell ACRL is that ACRL needs to make clear that self-interested, self- and/or library-promoting assessment violates the beneficence principle of ethical research and assessment. Beneficence is directly mentioned in the draft.

Okay. Let’s go through the draft now, starting with the Preamble.

“underlying expectation that assessment practitioners represent and promote the voices of all members of their campus communities,”

That is some vocational-awe nonsense right there. How about “inform?” Or “involve?” Or even “protect?”

“Apply ethics to guide all assessment activities.”

And here we have the Empty Genuflection to Ethics, a super-common “feature” of value-agenda discourse. Tell ACRL it’s not good enough. First they need to specify what kinds of ethics are in play. Possibilities include: human-subjects research ethics, library-specific ethics (hold that thought), ethics of care, feminist ethics. Then they need to specify how to apply ethics: specifically, that unethical modes of assessment must not be allowed.

The big library-specific ethics issue—probably obvious to the librarians in the room but let me say it so we’re all on the same page—is privacy. ALA Code of Ethics, Article 3, “We protect each library user’s right to privacy and confidentiality with respect to information sought or received and resources consulted, borrowed, acquired or transmitted.” Please note that it’s confidentiality and privacy. Patrons get both, not one or the other!

The word “privacy” appears zero times in this document. Zero. If you do nothing else, please inform ACRL that the absence of privacy from this document is unacceptable. Cite the ALA Code if you like. This is the big ethicswash in action, y’all! Don’t let ’em get away with it! Tell ACRL that you expect this document to state that any assessment practice that violates patron privacy is unethical. The start of the recently-added Article 9 is good too. “We affirm the inherent dignity and rights of every person.” Surveillance assessment violates both.

Okay, on to proficiencies. The Engaging Ethically section is a pile of ethicswashy nonsense. Pin them down, please. Tell them that wishywashy verbs like “affirm” and “maintain and enhance knowledge and skills” are not gonna cut it. Tell them you need to see verbs associated with actual assessment practices, like “collect, store, analyze, minimize, and secure” for starters. Notice also the absence of anything around data security or data governance anywhere in this document. Unacceptable. Tell them so.

“… advance racial and social justice in our libraries, communities, and profession through awareness, advocacy, education, collaboration, and allocation of assessment resources.”

But not, apparently, by actually doing assessment ethically. Unacceptable. Tell them so.

“Critically engage with questions around current and past assessment practices to ensure ethical practices are being followed and equitable outcomes are being produced.”

No. Ensure ethical assessment practices, full stop. “Critical engagement” is ludicrously insufficient. Do assessment ethically or not at all.

On to the building-knowledge section.

“Paired with the knowledge of social justice, equity, diversity, inclusion, and accessibility issues, practitioners are challenged to consider how assessment activities can be used to dismantle inequitable systems and structures.”

Consider? Just consider? No. We are challenged to dismantle, full stop. Also tell ACRL that practitioners must know about the history and present of inequitable data practices and uses, especially in libraries. None of this techbro-style “who could have known surveillance would be bad for the surveilled?” excuse-making.

On to part 3, on design, collection, and analysis.

There is, again, no mention of privacy here. This is exactly where privacy needs to be spelled out in operationalizable terms. Tell ACRL to do that. There is also no mention of data minimization, data security, or the appropriate boundaries of patron-data sharing here. Unacceptable. Tell ACRL to fix it.

“Balance the need for comprehensive data with being able to critically evaluate what data are needed to make informed decisions.”

Tell ACRL to add that no data gathered or handled in an unethical fashion can be considered “needed” for any decision ever, and that no data that violates the information privacy of library patrons can be considered “needed” for any decision ever.

“Understand and practice appropriate data management and handling techniques.”

This is just so milquetoast. Tell ACRL to be specific! DATA SECURITY. DATA MINIMIZATION. DEIDENTIFICATION. PRIVACY. GOVERNANCE. As is, this is unacceptable. Tell ACRL to fix it.

Section 4, reflecting and making meaning.

“Reflect upon whether the techniques used to analyze and present data are appropriate and effective within a particular context.”

How about reflecting on ethics? And how about adding “collect” to “analyze and present” here? Reflecting on ethics is also not enough. Reflections that lead to the conclusion that a technique is unethical must mean the end of the project, retractions of public communications based on the data if needed, a public incident report, and an apology to affected patrons.

“Determine whether assessment data are sufficient before making decisions or taking actions based on the results.”

This is dangerous. It directly advocates endless data grabs. Tell ACRL to put limits on it. “Determine whether collecting additional data adds risk to patrons, including to patron privacy.”

In the interests of sparing my hands, I’m skipping section 5. I want to get to section 6, the leadership section. There is nothing in it about ensuring ethical assessment. Not a single word. This is absolutely unacceptable. Tell ACRL to fix it. The ethical buck stops with leaders.

“Prioritize social justice, equity, diversity, inclusion, and accessibility goals wherever possible and motivate colleagues and team members to conduct equity-centered assessment.”

Prioritize privacy and ethics!

Self-check:

“Have I considered the values and ethics that guide my work and align to my institution, overall and in specific assessment projects?”

Again, considering is not enough. You enact values and ethics or you don’t. Tell ACRL so, and tell them that not enacting ethics is unacceptable.

Bibliography: contains not one single piece critical of the ethics of surveillance assessment. ACRL done told on itself again. Call them on it! The critical pieces absolutely exist! I dumped half my recent manuscript bibliography in ACRL’s Google form.

So yeah. There’s my take. Do with it as you please!

Shaarli: a good migration target from Pinboard

To make a long story short, Pinboard owner Maciej Ceglowski milkshake-ducked himself with bizarre and out-of-nowhere hair-splitting regarding whether JK Rowling is a TERF. I have too many trans and non-binary people in my life to give my money willingly to someone like that. So I was suddenly, unexpectedly, and unhappily in the market for a new linkspam tool.

A quick recap of my user story: I keep (not to say “hoard”) links because I build readings for course syllabi out of them. I also share tag-based URLs with students and colleagues when inspiration strikes. New links get shunted to my Mastodon account, since apparently some folks find that useful. I need my linkspam tool to handle a LOT of links, give me a reasonably fast link-add mechanism, produce an RSS feed, let me combine tags (ideally with search) into a filtered linklist and share the resulting URL with others, and let me build such URLs from memory based on my knowledge of my own tag use.

I tried LinkAce first. I can say in its favor that it can be run on CPanel-enabled shared hosting, though I wouldn’t call it exactly easy to install. (I did manage it, and I’m a terrible awful useless sysadmin, so yeah.) I can say little else in its favor—it’s painfully slow, its link-add page is almost as infuriating as Raindrop’s (the tag lookup is just deadly bad), and it doesn’t have combined-tag filters or intelligible URLs. I have hopes for it, but as-is, I can’t make it work for what I need. To add insult to injury, its HTML export (in the de facto exchange format for link tools) wouldn’t work for me.

I was scared off Shaarli at first because of the documentation’s exhaustive list of server prerequisites and incantations. I shouldn’t have been! It installed quite easily on my webhost! Let’s say you want your Shaarli to live on the web at linkspam.example.com, and in a folder named “linkspam” on your webhost.

  1. In CPanel, make the linkspam.example.com subdomain, and point it to your linkspam folder. (Currently this is done via CPanel’s “Subdomains” menu item, but apparently this functionality is being moved to “Domains” shortly.)
  2. Go to CPanel’s “Domains” menu item and toggle “Force HTTPS Redirect” on for linkspam.example.com. (Do it for all your other domains and subdomains while you’re at it. I’d missed a couple of mine!)
  3. Download the .zip file for the latest Shaarli release.
  4. Use CPanel’s File Manager (or SFTP, if you’re so inclined) to upload the .zip file to the folder one level up from your linkspam folder. (Trust me, okay?)
  5. In File Manager, delete your empty linkspam folder. (No, really, trust me!)
  6. Choose the Terminal menu item in CPanel. You’re in your home directory; if you need to, cd to the folder you put the .zip file in. Now type unzip sh and hit your tab key, which should autocomplete the filename for the Shaarli .zip file. Hit return.
  7. Go back to File Manager. Reload it. Find a folder named Shaarli, and rename it to linkspam. (Now you see where I was going with this! You could also do this in the Terminal with mv Shaarli linkspam.)
  8. Go to linkspam.example.com in your browser, and finish setup.

And that should be it. (On my to-do list: setting up an automatic backup for Shaarli’s “data” directory. Pretty sure I can do this directly in CPanel, via cron if necessary.)

I don’t have time for a full Shaarli report card, but here are a few things I’ve noticed in the couple of hours I’ve had it running:

  • Text search can be combined with tag filtering, which Raindrop can do but Pinboard can’t. Nifty, though it’d be more elegant with just one search bar and a parseable text trigger (probably #) for tags, as Raindrop does.
  • I don’t love Shaarli’s URLs—everything is query parameters—but I can live with them. They definitely copy-and-paste cleanly, unlike Raindrop’s or LinkAce’s, and the components are memorizable.
  • Bookmarklet is a tiny bit slow to load, but so was Pinboard’s sometimes; I can live with it. Bookmark entry is a breeze; Shaarli does not do LinkAce’s horrible horrible real-time(-ish) tag lookup.
  • OH MY GOSH, SEARCH OPERATORS! Phrase searching, minus-ing, wildcards! I will enjoy getting to know these.
  • The Wayback Machine integration is clever and useful, and will save me some time during syllabus construction. (Sometimes irreplaceable links 404.)
  • Shaarli could really use some CSS love. Maybe if I locate some spare time. One thing I would immediately do is get rid of the little tag icon next to tags. It’s purest visual clutter, but a simple display:none will take care of it.
  • What the hell is that QR code thing doing there, and why can’t I get rid of it?! Minor nit, it’s not all that obtrusive, but ugh.
  • I haven’t checked into its add-ons yet, but there seems to be a flourishing community, so I will.

So yeah, Shaarli is solid and useful and does what I need it to, and is not profiting any milkshake ducks. Fancy and pretty I don’t actually need. As for my Pinboard, I’m leaving it up for a bit until I’ve weaned my various syllabi and course assignments off it, and then it will go away.

Half a filk

It took me a lot longer to get to Hamilton than most. I dearly want to send Michael Gorman a ticket, after his racist diss of hip-hop that probably only I remember by now. Ah, well.

I made half a filk after watching the Disney production a while back. The rest is still to be written, I think; I hope it has a better ending for librarianship than for Hamilton. Grateful to the Genius website for making the lyrics available (and copiously annotated).

[Salo:]
There’s nothing like tenure-driven research
Data in the ILS meets data from the teachers
There’s value in the air, you can smell it
And a researcher’s by himself. I’ll let him tell it

[Researcher:]
I hadn’t slept in a week
I was weak, I was awake
You never seen a librarian
More in need of a break
Longing for significance
Missing ROI
That’s when Ms. Value Agenda walked into my life. She said:

[VAL:]
I know you are a professional
I’m so sorry to bother you at home
But I don’t know where to go, and I came here all alone…

[Researcher:]
She said:

[VAL:]
My admin’s doin’ me wrong
Puntin’ me, huntin’ me, defundin’ me…
Suddenly budget’s up and gone
I don’t have the means to go on

[Researcher:]
So I offered her a meal, I offered to break her Big Deal, she said

[VAL:]
You’re too kind, sir

[Researcher:]
I gave her some spreadsheets that I had socked away
She worked a block away, she said:

[VAL:]
This one’s mine, sir

[Researcher:]
Then I said, “Well, I should head back home,”
She turned foxy, she led me to EZProxy
Opened her data boxie and said:

[VAL:]
Stay?

[Researcher:]
Hey…

[VAL:]
Hey…

[Researcher:]
That’s when I began to pray:
ALA, show me how to
Say no to this
I don’t know how to
Say no to this
But my God, the data’s so fresh
And the journal’s saying, “Hell, yes.”

[VAL:]
Whoa…

[Researcher:]
No, show me how to

[Researcher/Ensemble:]
Say no to this

[Researcher:]
I don’t know how to

[Researcher/Ensemble:]
Say no to this

[Researcher:]
In my mind, I’m tryin’ to go

[Ensemble:]
Go! Go! Go!

[Researcher:]
Then the data lake’s online, and I don’t say…

[Ensemble:]
No! No!
Say no to this!
No! No!
Say no to this!
No! No!
Say no to this!
No! No!
Say no to this!

[Researcher:]
I wish I could say that was the last time
I said that last time. It became a pastime…

Off it goes…

Well, the article I started last spring is finished and off to a journal. I expect to have some trouble placing this one, because I’m just stubborn enough to send it to outlets whose practices it calls into question. (They are absolutely appropriate outlets for the piece, I hasten to say—I’m not wasting anybody’s time, that would be wrong of me.)

So I’ll track the rejections, and post ’em here once it’s finally accepted somewhere. Which it will be, I’m confident. I did good and useful work on this one, if I do say so myself.

Some tidbits about data handling in library learning analytics

So I finished building and coding up my dataset of library learning-analytics articles! That was a lot of work. I also have a data dictionary, and a methods section in the article draft! Yesterday I got to start writing queries against the database and writing them up in the article draft.

Want some tidbits about the 62 research projects I ended up studying (46 of them American)? Of course you do. Have some:

  • 35 of 62 projects, 27 of those 35 American, made no attempt whatever to deidentify data before analysis. Ahoy ahoy potential data leaks and insider threat!
  • 11 projects, 8 American, used data that revealed the subject of a patron’s inquiry. That’s a pretty bright-line no-no in libraries, folks.
  • Only 11 projects notified students about the specific research that would be taking place using their data. One more claimed that students were notified because the campus ID card terms of service told them research (unspecified) would be happening.
  • Actual informed consent? Five. Five projects sought it. Out of 62.
  • Wondering where ethics review was in all this? Yeah, me too. Of the 46 American projects, eleven passed IRB review, four were declared exempt, and a big fat nothing for the remainder. One of the 16 non-American projects received ethics review.
  • Sensitive data used in these projects included: socioeconomic status data or proxies thereof (13 projects), high-school performance data (GPAs and SAT/ACT-or-analogue scores, 13 projects), location data (7 projects), first-generation student status (6 projects), national origin or citizenship data (4 projects), military/veteran status (3 projects), and disability status (1 project).

I’ve got more; I wrote plenty of ANDed WHERE clauses yesterday (SQL is so much fun!), and more may occur to me as I continue the writing-up. But the above certainly gives you the flavor. It is not a good flavor.

Please miss me with all the gaping loopholes in which projects must receive ethics review. I know. That’s part of the problem! I plan to write about it at length in the other paper! (I also want to acknowledge David Fiander for giving me lots of useful intel on Canadian ethics-review loopholes yesterday on Mastodon. Appreciate it, David, and I’ll also acknowledge your help in one or both papers.) It may seem convenient to dodge all this red tape, but in my head what it really means is that LIS is letting its researchers show their ethics underwear all over the place, unguided and (crucially) unprotected. It’s not the Value Agenda for Libraries pushers whose careers will be tarnished when (and it’ll be when if I have anything to say about it) retractions and expressions of concern start happening; I expect they’ll claim it’s on researchers to Do Ethics Right, none of their concern. It’s pretty much academic librarians doing what VAL pushers told them was okay—not just okay, vitally important—who will be hung out to dry.

Not sure how the VAL pushers sleep at night, honestly—if my analysis holds water, which I think it does or I wouldn’t still be working on it, they’ve royally screwed students and librarians—but I suppose that’s not my problem.

Anyway, a lot of the discussion for this piece will be the first (as far as I know) attempt at examining real-world library learning-analytics practices in light of what we know from Data Doubles and similar research (which there’s rather more of now! yay!) about student preferences, the top two of which have repeatedly been shown to be notification and the chance to consent (or not). There’s an ethics-of-care argument there that I’m happy to make: if we care about students as much as we claim to, ignoring or overriding their stated preferences, especially for a research agenda that does not directly benefit them (hello benevolence! the Value Agenda for Libraries has none of you!), cannot be ethically acceptable.

I’ll publish the data, too. Zotero exports for both eligible and ineligible project citations, SQLite database, CSV database exports (though I need to think about building useful views for later-researcher convenience), basically the lot. CC0 on all of it, not that there’s much if any copyright in it to claim. You want to play in my data playground? Go for it.

It ain’t necessarily so

The publication Book Riot has been doing incredibly necessary journalism very skillfully around the latest rash of attempted and successful censorship of library materials and librarian voices. If you’re in the States and it’s not in your daily round of book news, whyever not?

They put out a great piece today on the mess in the Oklahoma City library system regarding abortion information. It’s really, really good, and probably headed for my fall intro syllabus.

But one sentence gave me pause: “[S]ome information privacy practices in public libraries emerged following the Patriot Act, which is why, for example, records of materials checked out by individuals are not saved and why it is shared computers are wiped of their histories between sessions.”

Oh. Oh, no. Oh, dear. Neither of these assertions is universally true, and the assertion about circulation records appears to be becoming steadily less true as CRM systems and assessment/analytics take firm hold. Usually the shared-computers thing is about available IT staff and budget rather than any sort of intentional retention plan, but a whole lot of libraries are strapped for IT talent and budget, so.

As for circulation records, I have direct proof of one academic-library system not deleting them! But long story short, libraries (both public and academic) that retain identified circulation records past materials return typically do so for one or more of the following reasons:

  • Patron convenience, commoner in public libraries than academic, but the academic-library consortium that serves my workplace explicitly names this as its retention rationale in its records schedule
  • “Assessment” and/or “analytics,” which is where CRM systems come into it
  • Academic-style research, which overlaps with assessment/analytics in tangled ways, as with “value of academic libraries” research
  • ILS settings that aren’t twiddled in favor of privacy
  • Isolated edge cases, such as special collections (where defacement and theft of materials by patrons are extra-serious issues)

In some libraries, it’s more than physical-materials circulation records—I don’t want anyone thinking “well, I never check out books, so I’m safe!” For academic-library-purchased ebooks and ejournals, identified or reidentifiable traces of patron information use can be left in proxy-server logs, which (I hear from e-resources librarians of my acquaintance) can definitely stick around a lot longer than they should. There’s also the whole question of what data the vendors are keeping, but that’s tangential to the Book Riot question, so I’ll let it go for this post, noting only that I wrote a big long thing about it that people can read, and they should also pay attention to Sarah Lamdan on the subject.

In response to a media query last Friday that went “Should people be concerned about their data privacy when it comes to searching for abortion-related resources?” I wrote the following paragraph about libraries:

If we are talking about libraries: yes, and as a librarian who strongly values information privacy I hate this answer, but it is the only honest answer I have. Too many libraries are retaining identified or reidentifiable search logs and identified circulation records far longer than they should. Too many vendors who sell online content access to libraries for use by patrons are using the same trackers and surveillance adtech as the rest of the web. I’m fighting my own profession to make it live up to its stated privacy ethics with everything I have in me—but it’s an uphill battle. Folks need to be aware that libraries, whatever our rhetoric as librarians, are not necessarily keeping them safe.

But I feel a bit of a filk coming on, so…

It ain’t necessarily so
It ain’t necessarily so
The privacy mottos
In library grottoes
It ain’t necessarily so

And I’ll leave it at that, before I get myself in trouble yet again calling out specific people, libraries, and practices.

I broke my Twitter leave of absence to ping Book Riot’s Twitter about this. I repeat here what I said there: there’s lots of skeevy stuff happening, and I’m as good an option as most to talk knowledgeably about it. Give me a shout, Book Riot, if you would.

Who’d like to be the Dr. Latanya Sweeney of library patron data?

Dr. Latanya Sweeney, for those who haven’t encountered her work before, is one of the titans of data privacy and reidentification research. Her work is the source of the oft-quoted factoid about nearly nine in ten Americans being uniquely identifiable with a combination of birth date, gender (binary assumed), and ZIP code. She’s deliberately reidentified politicians who blathered ignorantly about data privacy. She assembled evidence to call out search engine ad targeting for racism based on naming practices specific to African-American communities. Basically, she’s badass and a hero and I admire her exceedingly.

And I think there’s a tremendous LIS research agenda that grows out of her work (and the work of others, Arvind Narayanan not least) going begging: assessing the reidentifiability and risk profile of common sources of library patron data, and quantifying how possible it is to (re)associate a patron with evidence of (as ALA privacy guidance puts it) the subject(s) of their interest.

Such sources should probably include:

  • Retained circulation records, identified and various flavors of de-. It may seem obvious that identified circ records pose a hazard to patrons, but it’s really, really not obvious to many librarians and most patrons.
  • Proxy-server logs, identified and de-.
  • CRM system records; these are identified (or what would be the point?) — the research questions have to do with the use of these records to reidentify patrons in other data sources.
  • Chat reference logs, since they get retained for internal analysis and research a lot.
  • Website and web-service logs from all sources (local logging, SaaS-tool-logging, logging-by-web-tracker-emphatically-including-Google-Analytics, logging-by-usability-tool)
  • The usual-suspect computer-use logs and caches: browser caches, software-use caches, desktop-search caches, and the like.
  • Single-sign-on data, especially but not only when it’s pseudonymous or limited to entitlement data. This isn’t strictly-speaking library data, but it’s getting baked into library authentication processes deeply enough that I consider it a valuable arena of LIS inquiry.

Example research questions include but are emphatically not limited to:

  • How easy is it to associate a given patron with the subject(s) of their inquiry based on these data sources? Combinations of these data sources? Combinations of one or more of these data sources with common library-external data (e.g. for library learning analytics projects, GPA and major and demographic data and whatnot)? For academic libraries, combinations of one or more of these data sources with institution-external data (e.g. LinkedIn and alumni databases)? From the published LIS literature (because oh, are there ever skeletons in this closet and where is the Narayanan who will proof-of-concept them)?
  • Quantification of some standard measures of reidentifiability potential — k-anonymity and l-diversity and stuff like that.
  • Feasibility of reidentification-by-behavior-trail. For example, if an attacker (probably a library insider) with access to library data comes in with knowledge of a specific person’s likely interests, can they pick that person out of (just as one example) proxy-server logs? What data-retention time horizons enable/prevent such reidentification? Put another way… how unique do behavior trails tend to be, and which patron populations are most at risk of behavior-trail reidentification? (Like, some freshman in a large-lecture course with a canned research project likely isn’t super-reidentifiable from deidentified proxy logs… but I suspect pretty strongly that I, a longtime staff member with fairly outré intellectual interests, would be.)
  • What’s the potential of assessing group membership, particularly for groups targeted by law enforcement? Basically, if The Man drops by wanting to know who’s been searching up abortion or immigration or critical race theory or LGBTQ+ issues, can library data (alone or in combination with other data) rat patrons out to The Man as possible uterus possessors, or Dreamers, or people of color, or queer folks?
  • What actually are libraries’ present data-retention and data-handling practices? Records schedules? Privacy policies? Governance processes? Data-handling processes during internal assessment as well as research for publication? We just don’t know enough about this, and the work I’m doing barely scratches the surface of the work that’s possible. (Pour one out for ARL SPEC Kits; this would actually be a good use for them.)

Here’s the kicker. A lot of this work pretty much has to be done by working library practitioners, because they’re (quite properly, to be sure) the only folks who can actually get at library-internal data. I’ve wanted to do some of the above work for literal actual years, but as yet I haven’t located a library IT person willing to go in on it with me. There are additional wrinkles with library-external IT, too—I really want those single-sign-on reidentifiability studies to happen, but most librarians can’t unilaterally do them because (again) they can’t get at the institutional data without IT cooperation.

Like, I would actually prefer to be less paranoid than I am about library-patron data. It’d help my blood pressure, if nothing else. But without answers to the research questions I just posed, I… kind of have to assume the worst, based on the anecdata I have and on data-privacy solecisms evident in the LIS literature.

So. I’d be absolutely delighted to see a journal or conference or two—emphatically including code4lib journal—create some incentives for this work (and for doing it carefully and ethically, natch). Hey, editors and editorial boards, how about a themed issue? Even if it has to be guest-edited (and yes, I would absolutely serve in that capacity, pace my well-known unwillingness to donate labor to grossly exploitative commercial publishers). Hey, LIS conference organizers, can we get a track please? Hey, folks mentoring new academic librarians in need of research agendas, how about suggesting this one?

Let’s do this. It sure does need doing.