Dr. Latanya Sweeney, for those who haven’t encountered her work before, is one of the titans of data privacy and reidentification research. Her work is the source of the oft-quoted factoid about nearly nine in ten Americans being uniquely identifiable with a combination of birth date, gender (binary assumed), and ZIP code. She’s deliberately reidentified politicians who blathered ignorantly about data privacy. She assembled evidence to call out search engine ad targeting for racism based on naming practices specific to African-American communities. Basically, she’s badass and a hero and I admire her exceedingly.
And I think there’s a tremendous LIS research agenda that grows out of her work (and the work of others, Arvind Narayanan not least) going begging: assessing the reidentifiability and risk profile of common sources of library patron data, and quantifying how possible it is to (re)associate a patron with evidence of (as ALA privacy guidance puts it) the subject(s) of their interest.
Such sources should probably include:
- Retained circulation records, identified and various flavors of de-. It may seem obvious that identified circ records pose a hazard to patrons, but it’s really, really not obvious to many librarians and most patrons.
- Proxy-server logs, identified and de-.
- CRM system records; these are identified (or what would be the point?) — the research questions have to do with the use of these records to reidentify patrons in other data sources.
- Chat reference logs, since they get retained for internal analysis and research a lot.
- Website and web-service logs from all sources (local logging, SaaS-tool-logging, logging-by-web-tracker-emphatically-including-Google-Analytics, logging-by-usability-tool)
- The usual-suspect computer-use logs and caches: browser caches, software-use caches, desktop-search caches, and the like.
- Single-sign-on data, especially but not only when it’s pseudonymous or limited to entitlement data. This isn’t strictly-speaking library data, but it’s getting baked into library authentication processes deeply enough that I consider it a valuable arena of LIS inquiry.
Example research questions include but are emphatically not limited to:
- How easy is it to associate a given patron with the subject(s) of their inquiry based on these data sources? Combinations of these data sources? Combinations of one or more of these data sources with common library-external data (e.g. for library learning analytics projects, GPA and major and demographic data and whatnot)? For academic libraries, combinations of one or more of these data sources with institution-external data (e.g. LinkedIn and alumni databases)? From the published LIS literature (because oh, are there ever skeletons in this closet and where is the Narayanan who will proof-of-concept them)?
- Quantification of some standard measures of reidentifiability potential — k-anonymity and l-diversity and stuff like that.
- Feasibility of reidentification-by-behavior-trail. For example, if an attacker (probably a library insider) with access to library data comes in with knowledge of a specific person’s likely interests, can they pick that person out of (just as one example) proxy-server logs? What data-retention time horizons enable/prevent such reidentification? Put another way… how unique do behavior trails tend to be, and which patron populations are most at risk of behavior-trail reidentification? (Like, some freshman in a large-lecture course with a canned research project likely isn’t super-reidentifiable from deidentified proxy logs… but I suspect pretty strongly that I, a longtime staff member with fairly outré intellectual interests, would be.)
- What’s the potential of assessing group membership, particularly for groups targeted by law enforcement? Basically, if The Man drops by wanting to know who’s been searching up abortion or immigration or critical race theory or LGBTQ+ issues, can library data (alone or in combination with other data) rat patrons out to The Man as possible uterus possessors, or Dreamers, or people of color, or queer folks?
- What actually are libraries’ present data-retention and data-handling practices? Records schedules? Privacy policies? Governance processes? Data-handling processes during internal assessment as well as research for publication? We just don’t know enough about this, and the work I’m doing barely scratches the surface of the work that’s possible. (Pour one out for ARL SPEC Kits; this would actually be a good use for them.)
Here’s the kicker. A lot of this work pretty much has to be done by working library practitioners, because they’re (quite properly, to be sure) the only folks who can actually get at library-internal data. I’ve wanted to do some of the above work for literal actual years, but as yet I haven’t located a library IT person willing to go in on it with me. There are additional wrinkles with library-external IT, too—I really want those single-sign-on reidentifiability studies to happen, but most librarians can’t unilaterally do them because (again) they can’t get at the institutional data without IT cooperation.
Like, I would actually prefer to be less paranoid than I am about library-patron data. It’d help my blood pressure, if nothing else. But without answers to the research questions I just posed, I… kind of have to assume the worst, based on the anecdata I have and on data-privacy solecisms evident in the LIS literature.
So. I’d be absolutely delighted to see a journal or conference or two—emphatically including code4lib journal—create some incentives for this work (and for doing it carefully and ethically, natch). Hey, editors and editorial boards, how about a themed issue? Even if it has to be guest-edited (and yes, I would absolutely serve in that capacity, pace my well-known unwillingness to donate labor to grossly exploitative commercial publishers). Hey, LIS conference organizers, can we get a track please? Hey, folks mentoring new academic librarians in need of research agendas, how about suggesting this one?
Let’s do this. It sure does need doing.