OSINTing a library’s privacy practices

When I’ve taught ILSes or databases in the iSchool, I’ve always been pretty clear about How Things Work vis-à-vis circulation privacy. A circulation transaction is a row in a bridge/junction/associative table (dear database community: standardize your jargon, please, love, me), connecting a patron with something they’re checking out. Circulation assessment amounts to a checkout-count column in the item table, with one added to an item’s count each time it is checked out. (More or less. In-house use is often also tabulated, to the extent possible and with no attempt to record who actually plucked the item off the shelf.) When the item comes back in decent condition and fee-less, the row in the bridge table gets deleted, while the count is left alone. Simple. Data minimized. As private as the transaction can reasonably be while still allowing the library to keep tabs on its materials and measure what’s circulating and what isn’t.

I’m more than a little ashamed at having taught from such a state of naïveté, honestly. The evidence that the General Library System (GLS) at my institution was not handling circulation in this way was right there if I’d cared to go looking for it. What I’ll do now—again, with thanks to my spring 2021 LIS 510 students—is give you some idea where to go looking if you’re curious about a library’s stated practices.

OSINT stands for “open source intelligence” and is a bit of information-security jargon. It refers to researching a target without actually breaking into or even interacting with the target’s systems or people. It includes (but is not limited to) clever search-engine use, fossicking around on social media, and taking advantage of web-wide scanning services like Shodan. I teach a few extremely basic OSINT techniques (mostly Google dorking) in LIS 510.

Privacy policy

I mean, yes, this is the obvious place to start! Where this exists, it should be linked from the library’s home page, from the policies page if there is one, or from the About the Library page. My favorite library privacy policy is San Francisco Public Library’s. It’s strong, non-obfuscatory, and written with admirable precision and clarity.

In my institution’s libraries’ case, there is no overarching privacy policy. None whatever. Not at the UW-Madison level, not at the Council of UW Libraries level. So. There that is.

The sole privacy policy I know of for any library unit belongs to the UW Digital Collections Center and applies only to the digitized collections, finding aids, institutional repository, and other odds and ends under their purview. It, too, is short, clear, and reasonable, though a tad bit loophole-prone. I should mention that although I worked for UWDCC for some years, I had nothing to do with that policy. I’m not praising it because I helped write or institute it—I didn’t!

Applicable law

In the US, most states have laws pertaining to the confidentiality and/or privacy of library circulation records. Depending on the state, these laws may apply to public libraries only or both public and academic libraries; in Wisconsin it applies to any library accepting public (state) funding. Unfortunately, most such laws are too old to cover electronic-resource transactions. ALA has a helpful page listing these laws.

Wisconsin’s law is actually not great. If a library has “surveillance devices” anywhere, the law as I read it (I am not a lawyer and this is not legal advice!) allows law enforcement to stroll in and ask for pretty much whatever they want from them whenever they want without a court order; the only limiter is that the alleged crime has to have happened on (physical?) library territory. I really, really don’t love that! Wisconsin libraries, please don’t install surveillance devices!

As for circulation records, the law prohibits disclosure under most circumstances (there’s some stuff about minors that mostly won’t be relevant to an academic library), but there’s a loophole I could drive a sixteen-wheeler through: “persons acting within the scope of their duties in the administration of the library or library system” may disclose patron identities as they see fit. So if GLS employees take it into their heads to dump identified circulation records willy-nilly into a campus data warehouse for use by all and sundry, my reading of Wisconsin law says that I, as a patron viscerally infuriated by this notion, wouldn’t have a legal leg to stand on in opposing it.

I really, really don’t love this law. Go find yours and see if you love it any better.

Records-management schedule

Records managers (more recently, “information/data governance” folks) are the people who ensure that documents, forms, emails, and all the other bureaucratic detritus that organizations deal with are kept as long as legally and business-process-ly necessary and no longer. (Outdated records can be a legal liability; they also take up space unnecessarily. You want them gone.)

The first step in getting records under control is scheduling them—that is, figuring out how long each class of records should be kept. These decisions are documented in a records schedule, logically enough, and that’s a good search term to know.

In my case, the records schedules that industrious LIS 510 students found were absolutely the smoking gun. Here they are for your perusal, but I’ll quote the two immediately-relevant ones in full. First, UWLIB122:

UWLIB122 Circulation Records

Records documenting the borrowing of circulating library materials by qualified patrons. This series may include but is not limited to: bibliographic information of item, the name and identification of the borrower; the titles of materials borrowed; the length of time borrowed; the due date; overdue and fine payment notations; and related documentation and correspondence.

Circulation records are kept for the duration of a patron’s status as an authorized borrower as a courtesy to patrons interested in their borrowing history. Circulation records are handled by the Alma integrated library system. Alma migrates circulation records into its analytics database. These records are scheduled under UWLIB147 for longer retention.

Like, when were the libraries going to tell me, in some more prominent way than this bit of buried (though necessary) bureaucratese, that they were keeping my circ records “as a courtesy” to me? Being a trained and well-ethicked-up librarian myself, this absolutely violates my privacy expectations, detailed at the beginning of this post! I would love to write a piece with Helen Nissenbaum about that! (I would love to write anything with Helen Nissenbaum, let’s be clear here.) If I don’t consider this record retention a courtesy (and I very much don’t), what exactly is my route to getting the records deleted? There’s nothing I can find on the library website about it!

Oh. Right. Yeah. No privacy policy. Hm.

Moving on to the abovementioned UWLIB147:

UWLIB147 New Analytics Records

Records of both incomplete and completed circulation and interlending actions created by the library automated system (Alma) that are migrated into a separate database for purpose of statistical analysis. The records include patron data as well as bibliographic. Alma Analytics records are kept 10 years to provide better collection security for special collections materials.

New schedule added due to new functionality of the new library automated system (Alma)

I will have more to say about Alma Analytics, but I’m still doing my homework on it. Suffice to say that the idea that identified circulation records are sitting around for ten years for any reason makes my head explode. That they’re doing it in service of faddish, dubiously ethical, unproven (and where tested, largely ineffectual) Big Data practices makes my head vaporize with fury.

This is wrong and it should not be happening. Have we forgotten Henry Melnek? Have we forgotten the Connecticut Four? What is wrong with us, librarians?

Anyway. So. Let’s examine the records I released of my own borrowing behavior under these schedules. As I noted in the README, there are circ records on me all the way back to 2002. I can’t make that work under UWLIB122, because in 2005 I graduated from the UW-Madison iSchool (then SLIS) and moved away to take a job at George Mason University. From then to March 2007, I had no ties to UW-Madison except as alumna. As I read UWLIB122, my circ records should have been deleted in 2005. They weren’t. They still haven’t been. So the General Library System sure looks to me to be in violation of their own records schedule. Maybe the schedule was different in the pre-Alma Voyager days, but I don’t care. Once UWLIB122 became the working schedule, deletions should have happened in accordance with it.

I don’t appreciate that my circ records go so far back—I’m quite angry about it, actually—and I don’t appreciate that the libraries’ own records schedule is not reflected in its practices. This is not okay. (I’ll have more personal reflections on this later, but I’m trying, however unsuccessfully, to keep this post more nuts-and-bolts-ish.)

As for UWLIB147, it is extraordinarily mendacious and it infuriates me. Yes, as Lisa Hinchliffe recently brought back to my attention while we were discussing the NASIG panel we were both on, there actually are legitimate operational reasons special collections departments keep patron records a long time. People steal and deface unique, rare, and valuable materials and they need to be held to account for it. Using that as a cheap rationalization to retain all circulation records for ten whole entire years, however, is in my view the grossest of absolute garbage.

Trust me, the Alma system knows which materials belong in special collections; that’s one thing catalogers are for and the GLS has many excellent catalogers. “But we can’t limit extended records retention only to special-collections patrons, so we have to keep everything!” is therefore also garbage. Even if Alma doesn’t have this functionality built in—and I have to think it must; practically every library system has weird library- or collection-specific circulation rules somewhere or other, so Ex Libris must have built that—it would be pretty easy to dump special-collections records out of the system to retain them longer.

Campus data policies

Whoof. These are a nightmare on my campus, not least because the idea that data has to be properly managed and people have to be accountable for that is pretty new. (That’s hardly unique to UW. It’s the base state most places, I think.) Here is your starting place if you have plenty of time free.

What is there in that morass that touches library records? Pretty much nothing. If you’re interested in the learning-management system, however, as good student-privacy advocates certainly should be, there’s plenty you can learn, and it will likely not thrill you.

These kinds of policy warehouses usually have a small taxonomy of data sensitivity that they shoehorn data into. Find that first, because you’ll need it to understand the rest of the available documentation.

And that’s a wrap

That’s plenty to get you going. Happy OSINTing—I hope it’s happier than my students’ turned out to be.

Public-records requesting my own library records

Hi, all. Been a while since I posted anything here; I doubt I need to explain why. I do, however, want to announce the existence of a dataset and give a little background on it. I will have much more to say about it, here and in planned publications, but let’s start with the basics.

The dataset can be found on OSF, and the README.txt file contains the text of my public-records request and details about the files in the dataset. I won’t repeat those here!

Instead, I want to talk about how this came about, because some folks who are not me deserve credit.

I teach a mixed grad/undergrad course in information security and privacy (LIS 510) aimed primarily at non-technical layfolk. The big semesterlong group project for this class asks students to investigate various aspects of the privacy/security situation surrounding their choice of campus-related data. In spring 2021, two groups of graduate students in the LIS master’s program investigated library circulation data.

Both groups independently dug up the records schedule for circulation data, which I hadn’t previously known about and which absolutely blew my mind (and not in a good way) for reasons I will address in a subsequent post. It’s their doing I made this public-records request; if not for them I wouldn’t have known to ask. (I have invited them to work on one or more related publications with me; two have accepted and one more is thinking it over. I am giving them the choice of whether they care to be credited by name.)

I also want to thank the UW-Madison Public Records Office and the General Library System’s IT staff for getting this done promptly and completely, despite a bobble on my part that you can read about in the README.txt. I spent $280 on the request, and it was worth every penny. The records schedule is emphatically not the fault of these offices or their staffs; nor is the (as I believe it to be) immensely and unethically too long retention duration for circulation records. I should add that current Vice-Provost for Libraries Lisa Carter inherited this (as I believe it to be) deeply troubling situation from her predecessors; it is not her doing, either, and working out who actually is responsible will likely take at least one more public-records request.

There’s a lot to explore in this, and I expect it will take me a goodish while to write it all up. (Who has two thumbs and an emergency new-prep summer course that is eating her brain? This gal!) But at least the data is out there to look at now.

One step forward, many steps back

Below is a guest post from Becky Yoose, whom I am proud to count as a colleague, mentor, and friend.


Disclaimer – the following post represents the personal views of the author, and does not necessarily represent the views of her company, LDH Consulting Services.

Last June, the Santa Cruz Civil Grand Jury released a report about their investigation into Santa Cruz Public Library’s use of Gale’ Analytics on Demand to analyze patron data. The findings of the report included that SCPL did not properly inform patrons about the use of AoD, nor obtain consent from patrons to use their data in AoD, nor follow industry best practices – the list goes on. Overall the report found that SCPL use of AoD put patron privacy at risk. [1]

While the findings and recommendations from the civil grand jury report are not legally binding, the jury required the SCPL Director to respond to the report by September 23rd, 2019. The director’s response to the Santa Cruz Civil Grand Jury Report has been published and, well, it’s a hell of a mixed bag.

The response is divided into two sections: findings and recommendations. The recommendations section gives me some hope. SCPL created a new page about data privacy, posted their vendor assessment questionnaire, and even plan to do another round of privacy policy revisions after additional training. The recommendations to put in a more formal privacy program seem to be accepted, though from the response there is very little information about the “who, what, when, where, and how” of the implementation, including who exactly the new data privacy officer is for SCPL.

While these steps put SCPL on the right track in protecting patron privacy, the findings section tells a completely different story, and a troublesome one at that.

Overall, the director did not directly respond to the findings themselves, with the exception of a couple of findings. The major themes in the responses include:

  1. The state law ultimately permits use of AoD
  2. The state law doesn’t say anything about gaining consent or other matters
  3. Other libraries use AoD

Let’s break these three themes down.

The state law ultimately permits use of AoD

The director of SCPL uses this line to respond to the following findings:

  • The use of Gale Analytics on Demand by Santa Cruz Public Libraries was inconsistent with the library’s long-standing policy on Confidentiality of Library Records (policy 303, adopted February 2006; revised November 2010) and companion document, “Information We Keep About You.” (F1)
  • Santa Cruz Public Libraries did not adequately inform its patrons about the Library’s use of Gale Analytics on Demand or obtain their consent for this use. (F3)
  • Santa Cruz Public Libraries used Gale Analytics on Demand without adequately considering the patron privacy aspects of current California law. (F4)

What difference does it make if the grand jury concluded that use of AoD was ultimately allowed under current California law? That statement does not address the failures of the director’s library to make sure that the privacy policy reflected current practices, that patrons knew and understood how the library was using their data, and that there was a documented legal review of AoD before use. [2] The law might allow it, but that doesn’t necessarily mean that it makes it ok to do it. [3]

The state law doesn’t say anything about gaining consent or other matters

This response was used by the director in:

  • Santa Cruz Public Libraries did not adequately inform its patrons about the Library’s use of Gale Analytics on Demand or obtain their consent for this use. (F3)
  • Santa Cruz Public Libraries used Gale Analytics on Demand without adequately considering the patron privacy aspects of current California law. (F4)

This again is where compliance-only thinking by library administration is not in the best interest of patron privacy. Yes, legal regulations don’t cover everything, and they can’t—technology changes outpace technology regulations. The director is plainly stating that if the law doesn’t say anything that they are not doing anything wrong.

Let’s rephrase the above—this is a library director who is saying that letting patrons know that the library is using patron data in an analytics program isn’t important, nor is getting patron consent for said use, because the law doesn’t force the library to do so.

The original report does state that there is no regulation around consent and notice; however, the report goes on to say “[a]bsent guidance from the law, California libraries can turn to best practices in the library community to guide them in their interactions with third-party vendors.” (pg. 7). The report then provides an entire section of best practices from the library field (pgs. 8-10). The director completely ignores the existence of these best practices in their response.

Other libraries use AoD

The director used this line in their response to F7 – “The use of Gale Analytics on Demand by Santa Cruz Public Libraries is inconsistent with best practices in the library community regarding patron privacy.”

This, dear readers, is where I introduce you to the cookie jar.

A white hand taking a cookie from a clear glass cookie jar. Source – https://commons.wikimedia.org/wiki/File:Cookie_jar.jpg. Licensed under Creative Commons Attribution-Share Alike 3.0 Unported license – https://creativecommons.org/licenses/by-sa/3.0/deed.en.

There’s an old phrase – “caught with your hand in the cookie jar”. This is usually used when someone is caught doing something wrong. In this instance, the F7 response’s focus on “others do it, so why can’t we?” ignores the industry best practices section in the original report. Yes, there are many hands in that jar, but that alone doesn’t mean having your hand in the jar along with the others makes it right. You can’t point at others who had their hands in the jar saying that because they had their hands in the jar that you should have your hand in the jar, too.

Pointing at others not only tries to divert from the actual finding (being inconstant with best practices), but insinuates that you shouldn’t be punished because others haven’t been punished for doing the same thing. The best practices say that you shouldn’t have your hand there in the first place.

Overall, whatever gains we had in the recommendations section are overshadowed by the findings section. A library director not only oversees library operations, but sets priorities and strategic goals for the library. The library director has an overwhelming influence on organizational values, including how the library serves its patrons and its communities. Having a library director respond to privacy concerns brought up by members of the community—members who could also be patrons of the library – with dismissals and compliance-only responses sends a message to the community and the library staff that the library does not prioritize patron privacy beyond doing the bare minimum.

Why would the findings responses overshadow what’s already been done based on the report recommendations? Again, the library director sets the priorities, including where resources are spent and empowering staff to meet patron and community needs. A library director like the one we find in the responses in the findings section is less likely to fully fund a privacy program or give the assigned data privacy officer agency within the library to develop, implement, and assess privacy practices at the library. Any effort will end up being short lived, choked by lack of resources and staff agency.

I worry that many other library directors will adopt this type of response strategy when dealing with their communities’ concerns about privacy practices. This type of messaging is damaging not only to library staff who strive to protect patron privacy at the library and to the trust that patrons have in the library to protect their privacy, but to librarianship on an ethical level. In response to F1, the SCPL library director quotes Article VII of the ALA Bill of Rights:

VII. All people, regardless of origin, age, background, or views, possess a right to privacy and confidentiality in their library use. Libraries should advocate for, educate about, and protect people’s privacy, safeguarding all library use data, including personally identifiable information.

Given the responses in the finding section, the library director’s use of this quote might could have been replaced by the stock quote you find in many commercial businesses—“We value your privacy.” Based on the responses in the findings section, the first quote is starting to ring as hollow as the second.

[1] A Twitter thread analyzing the report can be found at  https://twitter.com/yo_bj/status/1143487843521556480

[2] There is no documentation in the report or response proving that the library performed such a legal review of AoD. It can be argued, then, that the civil grand jury did the legal review for them.

[3] For example, you can be in compliance with legal regulations and still have a privacy breach in the form of an ethics breach. The library profession has a few ethics codes, including codes of ethics from ALA and IFLA. Patrons also have rights in the library as outlined by those organizations as well, and those rights can be at odds with legal regulations.

Elementary, my dear Watson! or, why secure DNS is a good thing

Sherlock Holmes and Dr. John Watson strolled through my neighborhood on a pleasant September evening. What case they were working I have no notion, but I distinctly heard this exchange from my upstairs window:

“Small enough house, this,” said Watson, “and the yard is a horror, but the trim’s lately painted and the gutters new. Minor civil servant?”

“An academic librarian,” Holmes declared confidently, “with a professional interest in institutional repositories and, hm, quite likely other library technology as well. Possibly even one of Michael Gorman’s ‘blog people.’ Dried fruit is a favorite breakfast food, kept in a Whirlpool refrigerator—odd, that; I’d have thought this one a bit of a conservationist.”

“Remarkable, Holmes!” exclaimed Watson. “However did you determine so much without even a sight of the house’s inhabitants?”

“Elementary, my dear Watson,” said Holmes, drawing a Raspberry Pi out of his pocket. “I merely sniffed DNS queries on the house network.”


The Domain Name System (DNS) is the system that matches IP addresses to domain names. For example, as I type this, the domain dsalo.info is mapped to the IPv4 address 192.241.132.143. Any device needing to know this—to browse this website, for instance—sends a “DNS query” containing the domain name into the system, which (through rather roundabout means; here is a good explainer if you care to know more) determines and returns the corresponding IP address.

Here’s the thing. DNS queries and responses presently travel over the Internet in the clear, unencrypted. A teensy little Raspberry Pi computer can indeed sniff any network for them! This allows the Holmeses of this world to learn quite a lot about the Internet behavior of the sniffed network’s denizens, even if every single website they visit is HTTPS-encrypted. It’s much like the NSA’s cell-phone “metadata:” content is unavailable, but there’s plenty to be learnt without it.

The Not-so-Secret Royalty of the Internet is presently trying to fix this mess. Last I looked, there were three different proposals in the hopper for securing DNS queries from random detectives and other malfeasors. Though none of them is receiving unanimous acclaim, DNS-over-HTTPS appears to be something of a frontrunner due to ease of implementation.

Whichever proposal wins out, I would hope libraries would have the sense to implement it as soon as practical.


Below is the list of domain names on which Holmes based his deductions. I got the list by turning Wireshark loose to packet-capture my own home network one morning when nobody else was home, then saving the sniffed traffic as a “packet-capture” (colloquially “pcap”) file and running Wireshark’s Statistics --> Resolved Addresses command on it. By all means see if you can reproduce Holmes’s deductions! As I tell my students, I’m a hypocrite: I deplore the Internet of Things, yet I have an internetted Thing in my very own home.

# Resolved addresses found in /Users/Dorothea/Dropbox/Courses/510/ExamplePCAP.pcap
# Hosts
# 72 entries.
170.224.177.96 whirlpool.com
107.20.197.119 rpxnow.com
52.48.107.73 www.atmire.com
104.154.236.248 248.236.154.104.bc.googleusercontent.com
74.126.144.95 74-126-144-95.wansec.net
74.126.144.96 74-126-144-96.wansec.net
107.20.133.174 rpxnow.com
172.217.9.33 photos-ugc.l.googleusercontent.com
192.0.77.32 s0.wp.com
72.21.91.70 cs45.wac.edgecastcdn.net
104.19.196.151 cdnjs.cloudflare.com
52.55.165.109 grpc-web-proxy01-production-849742786.us-east-1.elb.amazonaws.c
23.45.144.149 e2486.g.akamaiedge.net
216.80.72.149 vbw1.ala.org
107.20.177.204 rpxnow.com
23.192.166.97 e3194.x.akamaiedge.net
104.19.199.151 cdnjs.cloudflare.com
18.210.177.140 ec2-18-210-177-140.compute-1.amazonaws.com
34.232.198.212 grpc-web-proxy01-production-849742786.us-east-1.elb.amazonaws.c
34.234.240.76 ec2-34-234-240-76.compute-1.amazonaws.com
151.101.184.134 prod.disqus.map.fastlylb.net
176.9.5.110 static.110.5.9.176.clients.your-server.de
52.84.11.178 d3hmp0045zy3cs.cloudfront.net
13.33.165.155 server-13-33-165-155.ord50.r.cloudfront.net
72.21.81.200 cs22.wpc.v0cdn.net
104.19.195.151 cdnjs.cloudflare.com
151.101.184.188 dualstack.imgix.map.fastly.net
198.58.116.203 meredith.wolfwater.com
192.0.77.2 i0.wp.com
52.222.217.179 server-52-222-217-179.msp50.r.cloudfront.net
128.148.254.67 plibwwwcit.services.brown.edu
10.0.1.1 Galactus.local
172.217.3.225 photos-ugc.l.googleusercontent.com
52.222.209.170 server-52-222-209-170.msp50.r.cloudfront.net
128.193.164.120 lib-c4l.library.oregonstate.edu
192.241.132.143 devo.reclaimhosting.com
23.21.245.99 sunmaid.com
104.19.198.151 cdnjs.cloudflare.com
192.0.73.2 secure.gravatar.com
52.84.11.137 d3hmp0045zy3cs.cloudfront.net
34.229.227.71 ec2-34-229-227-71.compute-1.amazonaws.com
34.224.175.4 webapi.camera.home.nest.com
52.54.32.125 webapi.camera.home.nest.com
94.130.66.93 mokum.place
224.0.0.2 all-routers.mcast.net
162.144.29.143 162-144-29-143.unifiedlayer.com
207.38.94.44 web623.webfaction.com
108.177.122.132 photos-ugc.l.googleusercontent.com
52.201.33.128 webapi.camera.home.nest.com
54.243.117.215 prodhost-1739552751.us-east-1.elb.amazonaws.com
52.84.11.145 d3hmp0045zy3cs.cloudfront.net
54.243.186.135 prodhost-1739552751.us-east-1.elb.amazonaws.com
52.222.209.29 server-52-222-209-29.msp50.r.cloudfront.net
52.44.49.101 grpc-web-proxy01-production-849742786.us-east-1.elb.amazonaws.c
93.184.216.182 s9.gp1.wac.gammacdn.net
52.86.243.133 ec2-52-86-243-133.compute-1.amazonaws.com
104.19.197.151 cdnjs.cloudflare.com
52.84.11.100 d3hmp0045zy3cs.cloudfront.net
2606:2800:220:807:12d7:2210:17b1:7cf s9.gp1.wac.gammacdn.net
2400:cb00:2048:1::6813:c597 cdnjs.cloudflare.com
2607:f8b0:4009:80a::2001 photos-ugc.l.googleusercontent.com
2607:f8b0:4009:815::2003 ord38s08-in-x03.1e100.net
2607:f8b0:4009:812::2001 photos-ugc.l.googleusercontent.com
2607:f8b0:4002:c06::5f yv-in-x5f.1e100.net
fe80::de9b:9cff:feef:be21 Galactus.local
2607:f8b0:4002:c09::be yb-in-xbe.1e100.net
2400:cb00:2048:1::6813:c697 cdnjs.cloudflare.com
2400:cb00:2048:1::6813:c397 cdnjs.cloudflare.com
2400:cb00:2048:1::6813:c797 cdnjs.cloudflare.com
2a04:4e42:2c::188 dualstack.imgix.map.fastly.net
2607:f8b0:4009:803::2016 ord37s08-in-x16.1e100.net
2400:cb00:2048:1::6813:c497 cdnjs.cloudflare.com

Get out! GET OUT!

I liked John Legend’s portrayal of Jesus in NBC’s Jesus Christ Superstar, but there was one moment I really needed a heavy-metal scream from him and didn’t get it. (As a sometime chorister, I understand! Heavy-metal screams are tough on the throat.)

In this Big-Data-brownnosing Chronicle puff piece, the movers behind Unizin and its grossly invasive Big Data platform make plain what lots of us (self included) suspected: this is an Orwellian-telescreen Bentham-panopticon play.

But as important or more, says Hilton, is what can be learned from Unizin’s capacity for data collection and analysis. Members have access to the online-learning platform Canvas and its associated data — not only from each member institution but also collective, anonymized data. That will increasingly enable discovery of broad trends from the overall membership, as well as micro-information for every class. Combine that with data on students’ demographics and background, Unizin says, and you have a powerful tool to make predictions, spot problems, and improve pedagogy.

“Anonymized” data isn’t, not at the level of detail Unizin is planning. “Micro-information for every class” means labor surveillance. The default is surveillance-on, of course, and it may well be the only option. As for Brad Wheeler? I’ve seen his big ideas before. One thing I personally, in-person, my own self, noticed is that he steamrolls everyone he can—especially women. (Cis male higher-ed educators in Unizin shops, you need to speak up, and the time you need to speak up is now.)

I am heavy-metal screaming right now. My classroom should be a house of learning! But you have forced in these data thieves!

Get out. GET OUT!

Patrons schooling librarians on privacy

My big-data ethics course is underway; I’m quite enjoying the changes it’s making in how I evaluate what I read. (“Deontologist,” I muttered to myself while bookmarking another ethics-in-AI piece the other day.) It also explains why Tattle Tape’s been a bit quiet lately—with that, Data Doubles work, RADD work, and revising and Canvas-ing my other summer course, I have no time to breathe.

Still, some things I can’t miss posting about and still be me, so. Yesterday a group of citizens of Santa Cruz, California put out an amazing document detailing how the Santa Cruz Public Library bought into a surveillance-as-a-service deal from Gale, and how that deal stomped all over patron privacy.

Let me say this again, a little louder: a group of public-library patrons absolutely schooled their library on privacy. What has happened to my profession. What.

Some librarians there can be proud of themselves: the ones who, in the words of the report, “voiced concerns about patron privacy.” Thank you and well done, SCPL librarians who spoke up. Those SCPL librarians and/or administrators who overrode those concerns should do some serious soul-searching. Y’all messed entirely up, and you are being called to account for it; the citizen group is a local “Grand Jury” and they have the authority to require a response from SCPL library top brass.

To that top brass I say: admit your fault, apologize sincerely, dump Gale right back into the filthy surveillance-capitalism abyss whence it came, and copy out the ALA Library Bill of Rights one hundred times longhand in full public view. I don’t want to hear any empty platitudes or who-could-have-knowns out of you—and more importantly, that grand jury doesn’t want to hear that either.

In my time pushing privacy, I’ve seen some librarians say that we can be trusted with patron data because we care about privacy and ethics—we’re the good guys. To that I say what I said to my big-data ethics students: no one can be intrinsically good, or indeed be good at all. We can only do good—or not. We librarians only care about privacy and ethics insofar as we put that caring into action.

SCPL didn’t do that. I’m so glad SCPL’s patrons called out the problem so cogently and effectively—and I am bitterly sorry and ashamed that they had to.

Triumph, RomAn-21-style

It is pretty nice to see librarians step up. That’s what just happened with our good buddy RA21. The chariot awaits, everybody. Climb in!

But I’ll be the whisperer behind the chariot: This probably isn’t over. I mean, RA21 will certainly go back to the drawing board, and a fittingly-embarrassed NISO will be at least a tiny bit more careful not to be so obviously vendor-captured next time…

… but the STM Association is a hard nut to crack. I’ve tangled with them before, and watched others tangle with them as well. They’re persistent, they don’t subscribe to library ethics, and they use every dirty trick in the book and a few outside it.

I’m less worried than I was, since my chief fear was that the STM Association with NISO as convenient clueless patsy would sneak RA21 past libraries unopposed. I’m now confident that won’t happen. What will happen is hard to guess—the standards space is labyrinthine and the STM Association knows it pretty well, so finding or even making another, quieter patsy is hardly off the table. We’ll just have to wait and see.

Physical-equivalent privacy

One of the confounders thrown at privacy advocates inside and outside libraries is “privacy is too nebulous a concept to operationalize!” I’m sympathetic to a point, because the philosophers and the ethicists and the lawyers and the coders have made rather a tangled mess of things, but frankly in my head I call this confounder an “ethics smell” (after programmers’ “code smell,” which is a hint somewhere in otherwise-working code that something in the program’s design or construction is very, very wrong). It’s a cheap excuse anyone can use to pretend privacy doesn’t actually matter and shouldn’t be a consideration. We who value privacy are still stuck answering this smelly objection, though, so how do we usefully do that?

In some library conference talks I’ve done, I’ve groped toward a formulation I’m now calling “physical-equivalent privacy.” That is, if we wouldn’t track a print book, or a person using the physical library, in a particular way, the digital analogue to that tracking behavior is also not okay. Put more formally, “the library patron using library-provided electronic information should enjoy privacy protection equal to that of the same patron using the same information via a library-provided physical information carrier.” This is not a perfect analogy, let me just state that up-front—physical surveillance is also ramping up in all too many contexts, even in libraries—but it productively tickles most folks’ sense of what’s creepy, and I think it also activates a lot of tacit operational-privacy knowledge in librarianship.

Let’s walk through an example: usage counting for spaces. Physical first.

  • Gate counts, or a count clicker? Not terribly creepy.
  • Counts by a person walking through the library? A little creepy, but the creepy factor can be managed through a clear explanation (“we’re only counting, not recording anything else about anybody”).
  • Counts that record possibly-visually-assessable demographic information (let’s say gender, race/ethnicity, and socioeconomic status) about patrons? Decidedly creepy, especially for patrons of already-oversurveilled populations, as well as quite prone to inaccuracy—one wouldn’t want to predicate services on such an assessment if one has the common sense of an oyster. Nor would one care to explain this practice to patrons, because not a few patrons (self included!) would flee the library for good.
  • Counts by a person who asks each patron about their demographic information? Highly, highly creepy—we’ve definitely arrived at “scary” now, if we hadn’t already.
  • Counts by a person who asks each person their name or library-card barcode number? So scary that the library would likely empty out.
  • Combining either of the last two with recording the library material that the patron is reading, watching, or listening to? Absolutely beyond the pale; I would expect huge protests from patrons, working professionals, and professional orgs.
  • Combining demographic information, name or other identifier, and materials choice? Ugh, just forget it—that’s utterly beyond scary into full Orwell.

You may have noticed that I left video surveillance out of the above list. You’re right; I did, and consciously so. Some people feel safer in the presence of video surveillance. I think they’re very often wrong to! But that still means I can’t usefully invoke video surveillance to tickle folks’ creepy meter, because it won’t always work as I’d intend it to. Audio surveillance might be an effective creepy-meter tickler, though, and it is salient in the context of voice-activated mobile phones and other devices.

Your sense of creepitude may differ from mine, but if you’re a librarian, I’d guess we’re fairly close together. (Caveats: some librarians don’t understand or believe that children also need and deserve information privacy, and some librarians have been utterly seduced by surveillance capitalism and its cousins “analytics,” “library value,” and “personalization.”) So let’s take that sense of creepy and apply it to digital analogues.

  • Gate counts? Analogous to unadorned usage counts—and I do mean “unadorned,” no IP addresses, no login information, minimal to no geolocation, exact time fuzzed at least to hour. Not particularly creepy in and of itself, but I might argue that the persistence of this information, and lack of disclosure about where else it will go and how it will be used, starts to get just a little creepy.
  • Counts by a person walking through the library? Hm. Let’s call it a person looking at the above unadorned usage counts. Not creepy—but also not visible to patrons in a way that begs to be explained to them. One important factor in creepiness is “doing something to/about me without my knowledge, much less agreement.” This fuels the often-researched observation that people get less comfortable with digital surveillance the more they find out about it.
  • Counts that record demographic information? In academic libraries, this can take the form of correlating library and library-materials use with information from the institutional student-records office. A lot of learning analytics projects have done this, not thinking anything of it. It’s certainly more accurate than the impressionistic physical version, but does that really cut the creepy any? As for public libraries, this is what customer-relationship-management tools like OCLC Wise, currently being heavily flogged at conferences, run on. It’s hella creepy in my book, and it’s noticeably less privacy than a patron using the physical library and physical library materials has. Not okay. Not okay!
  • Counts by a person who asks each patron about their demographic information? Same as above, really, except again, the “ask” part is missing. Learning analytics researchers slice and dice the data by demographic regularly, even when the numbers end up so small as to be wholly ungeneralizable—but highly reidentifiable. It’s creepy. It’s wrong. When the end goal is “personalization”—that is, a group of few or even one!—the creep factor multiplies further.
  • Counts by a person who asks each person their name or library-card barcode number? This one’s easy: many electronic-materials vendors in both public and academic libraries either collect this information directly or make the library collect it (e.g. by proxy server). For library-computer and network use, device identifiers (like mobile phone MAC addresses or IDFAs/ADIDs) come into play. And again, learning-analytics research has extended this identify-the-patron paradigm into information-literacy instruction, computer and network use, interlibrary loan, and other areas where it absolutely doesn’t belong. Creepy as all get-out.
  • Combining either of the last two with recording the library material that the patron is reading, watching, or listening to? Learning-analytics research, again. Have I gotten across how creepy that stuff is yet? Have I? I mean, go read it if you don’t believe me.
  • Combining demographic information, name or other identifier, and materials choice? Even learning-analytics researchers typically fuzz the materials choice. Not always, though, and some of them seem to wish they didn’t have to.

See how this works, kinda? I hope so. Another way to work it is starting from a digital-surveillance situation and working out its physical analogue. My favorite example, which I’ve used in my classroom, is the Adobe Digital Editions fiasco from 2014. To get this level of information about use of a checked-out print book, you’d have to follow the patron home and watch them as they read it! Not even slightly okay! So why is it okay when it’s an ebook?

I’m not pretending this or any analogy can be a silver bullet. I do think this could be a valuable rhetorical technique to sway folks on the fence about the latest library-privacy invasion, or to face down the Library Value Agenda at conferences or library-internally. In that latter case, I suggest combining it with a “you-first” strategy, to bring in the problems of power differentials and nonexistent or forced consent: “How about we pilot this by tracking library staff in this fashion? No? Why is it okay to do this to students, then?”

Good luck. I’d be interested to know about folks trying this, and how it turns out for them.

Thinking through RA21

“Recommended Practices for Improved Access to Institutionally-Provided Information Resources,” colloquially known as “RA21,” is open for public comment until the 17th of May. I definitely plan to comment, but this post is not that comment—this post is the plonking, pedantic process of figuring out and refining what I actually think so I can make a comment that is at least slightly more cogent than “ARGH, NO, CAN WE PLEASE NOT” or “Most mendaciously misleading specification title ever, y/n?”

You may have gathered from the above that I am not a fan of this spec. Quite right. I’m not. You know, cards on the table and all that.

One reason I am not a fan is the authorship list (see page 2 of the spec PDF), which is a bunch of big-pig publishers with a sprinkling of IT people and… (as best I can tell) exactly zero librarians, LIS faculty, or library or technology ethicists. (Peter Brantley works in a library but is not a librarian. Usually I wouldn’t care, but in this specific situation, I think it matters. I also think it matters that Brantley is a longtime participant in hypothes.is, a project that has signally failed to get to grips with the staringly obvious truth that a tool that forces comments onto every website ever is a living invitation to harassment and abuse. Not, in my book, trustworthy on privacy or safety—and I can’t help wondering whether this is precisely why he landed on the authorship group.) How am I not supposed to suspect that the big pigs are pulling a fast one on privacy here? Especially given that the big pigs are presently a miasma of surveillance-capitalism FOMO?

(Seriously, this whole thing is giving me flashbacks to back in the day when the NIH held hearings on a possible open-access policy and only invited the big pigs. That was ridiculously regulatory-captured—though fortunately the NIH got over that—and so is this. Does NISO not notice when it’s being used as a patsy and a figleaf?)

So. One obvious threat model here: the big pigs are trying to use RA21 to accumulate saleable (or otherwise exploitable, perhaps via Facebook-style microtargeted advertising) data about individual library patrons’ reading, data they often can’t get now because proxy servers and IP ranges are in the way. Let’s take as read that aggregate use data is either not saleable/exploitable at all, or not nearly as lucrative to sell/exploit as use data where individuals are identified or reidentifiable.

Another threat model that springs to mind: individualized and/or reidentifiable surveillance of patron information use, likely in the name of “learning analytics” or “assessment.” The big pigs could want to do this. Institutions could want to do this. Unholy alliances between big pigs and institutions (e.g. the sordid roiling cauldron of Arizona State plus Cengage, or the blithely horrifying Unizin Consortium) could want to do this.

Quick reminder, academic librarians: ACRL says quite clearly and distinctly that “The privacy of library users is and must be inviolable. Policies should be in place that maintain confidentiality of library borrowing records and of other information relating to personal use of library information and services.” (Another of my long-brewing rants has to do with “confidentiality” as a word librarians doing Seriously Shady Stuff hide behind. Rant for another day, though.)

Um. Possibly I should back up and explain what RA21 even is first? Yes. I should do that.

RA21 and single sign-on (SSO)

So, I explained before how libraries deploy IP range recognition and proxy servers to get Dr. Scientist through publisher paywalls. RA21 wants to replace this constellation of systems—which, to be fair, is decidedly kludgy and maintenance-intensive—with SSO, which is fairly (though not universally) ubiquitous higher-ed IT infrastructure. If you work and/or study in higher ed, chances are good that whenever you use your institution’s login credentials, there’s an SSO server authenticating you and at least partially authorizing you to do things with, on, or to institutional systems.

The central ideas fueling SSO aren’t too hard to get one’s head around. Organizations that run an SSO server to authenticate their students and staff are “Identity Providers” or “IdPs.” An IdP knows and can share various facts about the people represented in it; these facts are “attributes,” or taken as a whole “metadata.” Then there are services that need to ask IdPs about people in the organization, often though not solely to authenticate them and authorize service use; these are “Service Providers” or “SPs.” SPs may be internal or external to the organization—that is, a campus system as well as a big-pig publisher could rely on the campus’s SSO server for authentication and authorization. For probably-obvious reasons, though, internal SPs tend to have broader access to more attributes than external SPs. Clear so far? Cool.

What attributes do SSO servers typically store and share on people? This varies by IdP, and can be shockingly hard to find good information about. (One of the reasons this post has taken so long is precisely the time involved in trying to back-of-the-envelope calculate a baseline for attributes across IdPs. I’m posting now partly because of NISO’s RA21 feedback deadline and partly because I’ve given up on using OSINT to figure out how IdPs handle attributes. It’ll take somebody like Educause running a survey.) Dear campus-IT units everywhere: be transparent about this, would you please?

Brown University has the clearest, completest attribute rundown I’ve yet seen. (Dear campus-IT units everywhere: imitate Brown, please.) A few things to notice here:

  • Several attributes are actual personal identifiers. Others constitute personally-identifiable information. Still more are indirect identifiers—combined, they could pick someone out of the pack even without an actual personal identifier. (Has anyone run a study on how k-anonymous the entire list of someone’s eduPersonEntitlements or isMemberOfs is? I’d love to know.)
  • External SPs—such as our big-pig publishers—can request that Brown hand them at least one personal identifier (eppnPersonalName).
  • The exact circumstances or process under which Brown would grant an external SP access to this identifier are not stated. (I looked but didn’t find anything. I may admittedly have missed something, or the documentation for this may be on a Brown intranet.) If there’s a generic recommended process and data-governance scheme for this, I haven’t found that either.
  • The supposedly-anonymized personal identifier (persistent-id) is unchanging, which allows an external SP to build up enough of a behavior-trail or web-tracker dossier over time to reidentify someone. (John Mark Ockerbloom has already pointed this out to NISO. Yay! That means I don’t have to!)

So could my RA21 threat models work?

Going by the RA21 spec, how could big pigs pile up a saleable/exploitable dataset of individuals and their information use? Well, trackers on their websites, of course, but that’s out of scope for RA21 so let’s leave it aside. Could they do it if they con institutions into helping? Perhaps in the name of the Library Value Agenda or similar?

On page 8, the fourth desideratum for RA21 reads: “The system will achieve end-to-end traceability, providing a robust, widely adopted mechanism for detecting fraud that occurs at institutions, vendor systems, and publishing platforms.” Unless the sole imagined fraud prevention method here is cutting off Dr. Scientist’s unpermissioned literature text-mining project without any further attempt to locate or punish her—which I would be okay with, actually—that suggests that on some level and for some time period, patrons’ information use must be traceable back to them by someone. On what level, for what time period, and by whom? And does RA21 allow the information necessary for fraud prevention to be (mis)used for other reasons, say, Big Pigs wanting to sell or exploit it? Who knows? RA21 doesn’t say. It damned well should.

The real meat is in section 2.3 starting on page 14. Line 1 column 2 of the table (minus the headers, you persnickety people) reads “Unless the SP has a specific, contractual agreement with an IdP, the IdP should only send anonymous and pseudonymous identifiers to the SP.” There. There. There is the incredibly giant loophole the big pigs will just waltz right on through. All SPs have to do to snoop individual patrons’ information use is convince IdPs to let them.

We trust IdPs implicitly… why exactly? Dear NISO: It’s called “insider threat.” Look it up. Then add it to your hilariously incomplete security/privacy threat model.

If I had better vidding skills than I do, I’d voiceover a scene from Disney’s Little Mermaid:

Big Pig: Come in. Come in, my librarian. We mustn’t lurk in doorways—it’s rude. One might question your upbringing… Now, then. You’re here because you have a thing for assessment. This, er, learning analytics doohickey. Not that I blame you—it is quite a temptation, isn’t it? Well, awed vocationist, the solution to your problem is simple. The only way to get what you want—is to send me an identifier attribute and ask me to collect the identified data for you.

Librarian: You can do that?!

Big Pig: My dear, sweet librarian. That’s what I do—it’s what I live for. To help underappreciated librarians like yourself. Poor souls with no one else to turn to.

I’ll spare you the actual filk of the song. (If you want to do it, though, go right ahead!)

It gets worse. Proxy servers and IP ranges are generally handled by the library, often with an assist from campus IT. SSO servers, however, are the domain of campus IT, with zero input whatever from the library. Campus IT does not subscribe to library ethics. Campus IT has different pressures on it than the library—most provosts, chancellors, and deans know better than to openly force the library to compromise privacy; the same is not true of campus IT (which often enough doesn’t give a crap about privacy anyway unless forced to). So the Big Pigs can probably have whatever identifiable data they want just by hinting to campus IT or campus administrators that it’d be useful assessment data to have, and wouldn’t you darling people just love to request it from us? All you need to do is send us the identifier attribute that makes it possible!

Not, frankly, that many academic libraries would resist temptation all that much better.

RA21 must put a stop to this scenario if it is to conform to library privacy norms. IdPs must only send identifier attributes in cases of suspected fraud, and when the case is closed, they must stop sending those attributes. SPs must only request identifier attributes in cases of suspected fraud, and when the case is closed, they must cease the request and discard all identified data. Furthermore, IdPs must never grant SPs access to attributes without the library’s explicit approval, and the library may only ever grant such approval for identifier attributes (or indirect identifiers in excess of what is absolutely necessary) in cases of suspected fraud.

Or, to boil it down to its essentials, try this: Neither IdPs nor SPs may sell, share, or use identifiable data, nor may they attempt reidentification of deidentified data, for any reason except as explicitly stated in (what had better be a new and very greatly improved) RA21. I mean, I suspect this to be a pipe dream; NISO hasn’t the guts to tell assessment and learning analytics to take a long walk off a short pier, and the Big Pigs clearly see assessment and learning analytics as an easy lockpick for the identified-data store. But I can dream, can’t I?

Line 2, column 2 expands on this theme: “In addition to the attributes provided by academic libraries, an additional attribute may be sent to the SP to support specific granular usage analysis or charge back requirements that the IdPs may have.” What does “granular” mean exactly? Who in the IdP may initiate such a request, for what reasons? What else can the SP do with this attribute? Who knows? RA21 doesn’t say. It damned well should.

I do approve of the last line on page 15: “In the case that the IdP sends more attributes than the minimal request from the SP, the SP must not collect or store that data under any circumstance.” Keep that. It’s not enough without a definition of “minimal request,” though—a Big Pig could perfectly well ask for every available attribute and call that minimal.

The rest of the spec is devoted to questions of UX. There isn’t a single additional word about privacy or security that I saw. So there we are. Threat model validated!

NISO, spec authors, the RA21 spec should not be approved, much less implemented, without vastly better data-governance protection for patrons against insider threat from IdPs and SPs than it currently contains. Fix it, please.

Teaching adversarial thinking

In case you missed it: A couple months ago a law prof brought on the wrath of academic Twitter by suggesting that students spend a week eavesdropping on the conversations of others to listen for people betraying their own security and privacy, a thing that people quite commonly do. Some of academic Twitter—self included—was initially entranced, until other parts of academic Twitter asked whether casual snoops (or even not-casual snoops) was really an okay thing to turn our students into? Especially when many of our students are still so unaware of the workings of privilege, such that snooping can take on exceptionally sinister overtones applied to certain populations?

So the initially-entranced folks, self included, backed off our initial enthusiasm, and the furor seems to have mostly died down. I, however, am still stuck with a pedagogical problem: as an instructor in introductory information security, I actually do have to teach people to snoop on, and even attack the privacy and security of, other people and systems. I know that sounds horrifying. I know it does! And it definitely gets into some pretty dark-gray gray areas. But stick with me just a bit longer while I explain.

Over a longish period of information-security work, it’s become clear that the only way to have any confidence at all that a system (in the large sense, so not just “a technological system” but “a technosocial system, emphatically including the people involved or even enmeshed in it”) is secure or private (not, of course, the same thing) is to test it by attacking it:

  • To test whether deidentification suffices to anonymize a dataset (spoiler: it rarely if ever does), researchers try to reidentify one or more people in it, often using additional available data to test reidentification via dataset overlap. See, for example, the Narayanan and Shmatikov paper that doomed the Netflix recommender-system contest.
  • To test the security of a given piece of software, you ultimately try to break it. Yes, there are tools (e.g. “Google dorks,” “vulnerability scanners,” “fuzzers,” even Shodan) to locate obvious or common problems, but they’re not enough. A creative, lateral-thinking human being is much better at finding exploitable holes in code than a computer.
  • To prioritize and test for holes in systems (again, “system” writ large), you first think like an adversary—what are the crown jewels in this system, and how would someone who wants them attack the system? This is called “threat modeling,” and thinking-like-an-adversary is a crucial part of it; without that, you end up with what Bruce Schneier calls “movie-plot threats” while ignoring the gaping system problems right under your nose (as, for example, Equifax certainly did). A crucial insight in threat modeling, captured concisely in this xkcd cartoon, is that your enemies always attack with the simplest method likely to work.
  • And once you have your threat model, you test how well your system resists it by, well, attacking your system in the ways you have identified it to be potentially vulnerable! This often happens in the form of “penetration testing,” which can be done on physical systems, social systems, technological systems (such as networks or software), or any combination of the three. My favorite example of a pentest that goes after all three types of system is this absolutely astounding Twitter thread, which I use in my intro course, and after which I named the class’s messing-around server “Jek.”

So I can’t get around it. If I’m to prepare students to take information privacy and security seriously, never mind enter actual infosec and privacy careers, I have to show them how to think like a Garbage Human (which is how I often phrase it in class), and I have to show them how to attack systems (writ large). How do I do this without turning them into Garbage Humans themselves?

This isn’t exactly a new problem in infosec, of course; the learn-to-defend-by-attacking paradox is the earth out of which Certified Ethical Hacker, CIP{M|T|P}, and similar tech-plus-thinking-about-law-and-ethics certifications grew. It’s not even a new problem generally—if we were to strip academe of everything that could be used to Garbage Human, how much of academe would be left? (Yes, yes, plenty of computer scientists claim that computer science would be left. Those computer scientists are wrong, wrong, wrong, wrong, wrong about that.)

What I ended up doing, because I felt more than a little bad about accepting the law-prof’s assignment idea so uncritically, was going back through my syllabus, assignments, and class slides looking for how I’d approached gray areas and put guardrails around students’ growing potential for Garbage Humanning. What I found fell into an actually rather small number of techniques:

  • Clearly and often laying out stuff that’s either illegal or so Garbage Humanny that it should be. For example, I use altering physical mail envelopes as an analogy to various address-spoofing attacks… but I also explicitly point out that mail tampering is amazingly illegal in the US and they shouldn’t do it. In person in the classroom, I am not at all shy about labeling certain practices Garbage Human territory.
  • Giving copious examples of how real people and organizations have been harmed by attack techniques. I can’t control whether my students use what I teach them to Garbage Human. I can control whether they can reasonably use the excuse “I didn’t know this could hurt anybody!” and I definitely try to.
  • When students in my class perform actual reconnaissance, attack, or forensics maneuvers, they’re doing it on me, on themselves (a good habit to get into! and certainly how I prep any assignment where they’ll be looking at me or my data), or on canned datasets created for the purpose (yes, I use the Greg Schardt/Mr. Evil dataset, for lack of one that’s more recent). They’re not doing it on unwitting and possibly-extra-vulnerable targets. Again, the techniques they’re learning absolutely can be repurposed for Garbage Humanning—but I’m clear that I don’t want them doing that, and I don’t give them any actual practice kicking down.
  • Keeping the emphasis on “attack to defend” throughout. They’re not learning adversarial thinking and attack techniques to turn into Garbage Humans, but to equip themselves to defend themselves, their loved ones, and those for whom they are in some way responsible against the depredations of Garbage Humans.
  • Being open about my own dilemmas vis-à-vis Garbage Humanning. For example, I am unbelievably tempted to pull a Narayanan-and-Shmatikov on the Minnesota learning-analytics dataset, the one from several Soria, Nackerud, et al. publications. Even though I don’t actually have that dataset (and don’t want it, good gravy, what a terrifying responsibility), I’d bet Large Sums of Money that knowing the cohort entry year (which, yes, they published) is enough all by itself to find some folks in the dataset via LinkedIn or a date-bracketed Google dork against the University of Minnesota’s website, and I might even be able to find some folks in their painfully-low-n outlier groups. Possible? Unequivocally. Absolutely without question possible. I’m not even good at reidentification and reconnaissance techniques and I am absolutely sure that I can do this. Ethical? … Well, that’s a tough one, which is why I haven’t actually done it.

Is this enough? I don’t know. I’m certainly still kicking the problem around in the back of my head, because if I can do better than I’m doing, I want to.