Behavior data vs. patron privacy

Productive discomfort

Someone is gathering every crumb you drop / These mindless decisions and moments you long forgot / Keep them all!
—Vienna Teng, “The Hymn of Acxiom

I’ve finally dumped GMail forever.

Though the process took quite some time—moving mailing-list subscriptions, changing profiles on websites that knew me by my GMail address, extracting the messages I needed to keep, and similar chores—the relief of a little more freedom from Google’s privacy-invasive data mining has been well worth the trouble for me. I want as little as possible to do with a company that allegedly thinks trawling and keeping behavior-profile data from college students’ school-mandated, school-purchased email accounts without notice or consent is in some way ethical.

I bring this up because of a strong tension I noticed at the recent Library Technology Conference between library notions of privacy and academic libraries’ salutary desire to use various forms of patron behavior data to improve websites and other services. How much are we willing to snoop to get better at what we do? How do we gauge potential (not actual, let us pray) harm to patrons? When we do decide that snooping is worth the risks, how do we protect our patrons from data breaches (making the news at too many higher-education institutions of late) and reidentification attacks? How do we avoid participating in today’s sinister commercial and political nightmare of greedy, thoughtless, not-always-disclosed physical and digital surveillance? Does performing surveillance in our much-trusted libraries not legitimize the other surveillance regimes?

We cannot assume that the data we could and sometimes do gather about our patrons would be of no interest to the powerful or punitive. We know better, so we protect circulation records and computer-use histories as best we know how, and interpose our proxy servers and sign-in pages between snoopy electronic publishers and our patrons’ identities. We saw last year in the Aaron Swartz case the worst that can happen when we decline to interpose ourselves, and we also have good reason to be wary of privacy violations by providers of electronic content. An odd twist in the Georgia State e-reserves opinion lends point: some of the infringement claims were dismissed by the judge because access logs showed none of the students had actually downloaded the allegedly infringed-upon material. If this segment of the opinion holds up on appeal, it would seem to offer publishers holding copyrights in works used in higher-education classrooms tremendous incentive to examine data on student reading, and demand that institutions and their libraries gather and keep that data for them, in order to find grounds to sue us more.

At the same time, I certainly don’t want to paint data-gathering librarians with the same brush as Google, much less monumental consumer-behavior profiler Acxiom. Librarianship already has professional-ethics commitments regarding privacy that apply to data, which is a good start. I’m fond of principles III and VI of the Code of Ethics of the American Library Association, myself. Our data-collection motives are also rather purer, our data easier to tie to obvious patron benefits: sets of aggregate data, from COUNTER usage statistics to website access logs, are profoundly helpful for service refinement, website usability improvement, and collection development. Academic librarians don’t share these data (except distilled into harmless charts or tables), or aggregate them with other libraries’ data (except very carefully indeed), or mine them for individual identities, or keep them forever just in case, or willingly turn them over to businesses or government. If only other data-gatherers regularly behaved like libraries!

What we don’t seem to have yet is a profession-wide sense of how to apply our ethical commitment to privacy to digital information-behavior data, such as we can gather from website access logs, proxy-server logs, or web trackers placed in our websites or OPACs. We don’t to my knowledge have best-practice documents, charts and checklists, sample policies, or the rest of the mundane apparatus that helps us navigate other ethics questions without stopping in the middle of our busy days for ponderous pondering. (If I’m wrong about this, I would love to know more; please leave a comment correcting me.) I can only begin to imagine what this apparatus will come to look like, and I certainly can’t prescribe it from on high. It needs to be the fruit of a collective discussion. Fortunately, events like Library Technology Conference are starting that discussion.

At lunch on the second day of the conference, after my session on patron-computer privacy, a student at the library school where I teach asked me whether I approved of the systematic catalog usage tracking one presenter discussed at a session we had both attended. A level, don’t-you-dare-equivocate stare accompanied the question, an expression I dearly love to see on student faces because it demonstrates so clearly their willingness and ability to think critically about anything I or anyone else tells them. I sighed and said, “I wish they weren’t using Google Analytics.” That was easy to say; Google has repeatedly shown with Google Buzz, Google Plus, and various of its data-mining efforts that its notion of privacy does not measure up to library standards, so convenient though its tools undoubtedly are, privacy-conscious academic libraries should avoid them. (In my session, an attendee pointed out Piwik as a self-hosted and therefore less invasive Google Analytics alternative. Businesses and consortia that host library websites as a service would do well to offer Piwik to their clients.)

After that, though, I had to stop and think. I eventually said, “With the way they’re scrubbing data, it seems mostly okay to me, but I’d want to know more about their data-disposal schedule, and… I’d want them to feel uncomfortable about holding that data.”

It’s the last piece of that answer that I still stand behind. I want academic librarianship to feel uncomfortable about accumulating patron information-behavior data, even anonymized, even in aggregate. I want that discomfort to cause us not to collect patron information-behavior data at all without a clear need for it, to collect the scantiest data possible when it is needed, to guard that data well, and to throw it away like a hot potato as quickly as feasible to keep ourselves and others from the temptation to abuse it. I want us to endure the uncomfortable process of writing data-retention and data-privacy policies that treat patron privacy as a dominant concern. Data discomfort is productive, just as the tension at Library Technology Conference was. Productive data discomfort will help libraries remain an excellent example of consciously ethical privacy practices… an example much of the rest of society desperately needs just now.

This doesn’t mean we won’t ever collect data. This doesn’t mean we won’t ever keep data. This doesn’t mean we won’t ever use data. With luck, it means we will be careful enough about data collection, retention, and use to protect our patrons and keep their trust in us intact. No library patron should have to walk away from a library for the same reason I walked away from GMail.

I also believe that as privacy watchdogs within our institutions, academic librarianship needs to cast a critical, privacy-minded eye over the student-analytics movement. InBloom, a would-be K-12 student profiler/tracker whose products and services I find decidedly creepy and intrusive, has been defeated for now by teachers, parents, and librarians, but I am still seeing course-management systems and student-records systems in higher education discussing or even implementing tracking measures without much heed paid to student privacy. That such dubious features may not presently work well and can be ignored—I saw no use whatever in the so-called analytics that turned up in the most recent upgrade of my campus’s course-management system—does not exempt us from questioning the collection and retention of student behavior data. Ideally, we should do so before tools based on that data develop enough to be both seductive and dangerous.

Oh, and because quite a few people ask whenever I tweet about leaving GMail: I’ve moved my professional non-work email (mailing lists and so on) to an email account on my own web domain; the actual mailserver is managed by the company I pay to host that domain. It’s working great so far.

Note: This post is copyright 2014 by Library Journal. Reposted under the terms of my author’s agreement, which permits me to “reuse the work in whole or in part in your own professional activities and subsequent writings.”