Thinking through RA21

“Recommended Practices for Improved Access to Institutionally-Provided Information Resources,” colloquially known as “RA21,” is open for public comment until the 17th of May. I definitely plan to comment, but this post is not that comment—this post is the plonking, pedantic process of figuring out and refining what I actually think so I can make a comment that is at least slightly more cogent than “ARGH, NO, CAN WE PLEASE NOT” or “Most mendaciously misleading specification title ever, y/n?”

You may have gathered from the above that I am not a fan of this spec. Quite right. I’m not. You know, cards on the table and all that.

One reason I am not a fan is the authorship list (see page 2 of the spec PDF), which is a bunch of big-pig publishers with a sprinkling of IT people and… (as best I can tell) exactly zero librarians, LIS faculty, or library or technology ethicists. (Peter Brantley works in a library but is not a librarian. Usually I wouldn’t care, but in this specific situation, I think it matters. I also think it matters that Brantley is a longtime participant in, a project that has signally failed to get to grips with the staringly obvious truth that a tool that forces comments onto every website ever is a living invitation to harassment and abuse. Not, in my book, trustworthy on privacy or safety—and I can’t help wondering whether this is precisely why he landed on the authorship group.) How am I not supposed to suspect that the big pigs are pulling a fast one on privacy here? Especially given that the big pigs are presently a miasma of surveillance-capitalism FOMO?

(Seriously, this whole thing is giving me flashbacks to back in the day when the NIH held hearings on a possible open-access policy and only invited the big pigs. That was ridiculously regulatory-captured—though fortunately the NIH got over that—and so is this. Does NISO not notice when it’s being used as a patsy and a figleaf?)

So. One obvious threat model here: the big pigs are trying to use RA21 to accumulate saleable (or otherwise exploitable, perhaps via Facebook-style microtargeted advertising) data about individual library patrons’ reading, data they often can’t get now because proxy servers and IP ranges are in the way. Let’s take as read that aggregate use data is either not saleable/exploitable at all, or not nearly as lucrative to sell/exploit as use data where individuals are identified or reidentifiable.

Another threat model that springs to mind: individualized and/or reidentifiable surveillance of patron information use, likely in the name of “learning analytics” or “assessment.” The big pigs could want to do this. Institutions could want to do this. Unholy alliances between big pigs and institutions (e.g. the sordid roiling cauldron of Arizona State plus Cengage, or the blithely horrifying Unizin Consortium) could want to do this.

Quick reminder, academic librarians: ACRL says quite clearly and distinctly that “The privacy of library users is and must be inviolable. Policies should be in place that maintain confidentiality of library borrowing records and of other information relating to personal use of library information and services.” (Another of my long-brewing rants has to do with “confidentiality” as a word librarians doing Seriously Shady Stuff hide behind. Rant for another day, though.)

Um. Possibly I should back up and explain what RA21 even is first? Yes. I should do that.

RA21 and single sign-on (SSO)

So, I explained before how libraries deploy IP range recognition and proxy servers to get Dr. Scientist through publisher paywalls. RA21 wants to replace this constellation of systems—which, to be fair, is decidedly kludgy and maintenance-intensive—with SSO, which is fairly (though not universally) ubiquitous higher-ed IT infrastructure. If you work and/or study in higher ed, chances are good that whenever you use your institution’s login credentials, there’s an SSO server authenticating you and at least partially authorizing you to do things with, on, or to institutional systems.

The central ideas fueling SSO aren’t too hard to get one’s head around. Organizations that run an SSO server to authenticate their students and staff are “Identity Providers” or “IdPs.” An IdP knows and can share various facts about the people represented in it; these facts are “attributes,” or taken as a whole “metadata.” Then there are services that need to ask IdPs about people in the organization, often though not solely to authenticate them and authorize service use; these are “Service Providers” or “SPs.” SPs may be internal or external to the organization—that is, a campus system as well as a big-pig publisher could rely on the campus’s SSO server for authentication and authorization. For probably-obvious reasons, though, internal SPs tend to have broader access to more attributes than external SPs. Clear so far? Cool.

What attributes do SSO servers typically store and share on people? This varies by IdP, and can be shockingly hard to find good information about. (One of the reasons this post has taken so long is precisely the time involved in trying to back-of-the-envelope calculate a baseline for attributes across IdPs. I’m posting now partly because of NISO’s RA21 feedback deadline and partly because I’ve given up on using OSINT to figure out how IdPs handle attributes. It’ll take somebody like Educause running a survey.) Dear campus-IT units everywhere: be transparent about this, would you please?

Brown University has the clearest, completest attribute rundown I’ve yet seen. (Dear campus-IT units everywhere: imitate Brown, please.) A few things to notice here:

  • Several attributes are actual personal identifiers. Others constitute personally-identifiable information. Still more are indirect identifiers—combined, they could pick someone out of the pack even without an actual personal identifier. (Has anyone run a study on how k-anonymous the entire list of someone’s eduPersonEntitlements or isMemberOfs is? I’d love to know.)
  • External SPs—such as our big-pig publishers—can request that Brown hand them at least one personal identifier (eppnPersonalName).
  • The exact circumstances or process under which Brown would grant an external SP access to this identifier are not stated. (I looked but didn’t find anything. I may admittedly have missed something, or the documentation for this may be on a Brown intranet.) If there’s a generic recommended process and data-governance scheme for this, I haven’t found that either.
  • The supposedly-anonymized personal identifier (persistent-id) is unchanging, which allows an external SP to build up enough of a behavior-trail or web-tracker dossier over time to reidentify someone. (John Mark Ockerbloom has already pointed this out to NISO. Yay! That means I don’t have to!)

So could my RA21 threat models work?

Going by the RA21 spec, how could big pigs pile up a saleable/exploitable dataset of individuals and their information use? Well, trackers on their websites, of course, but that’s out of scope for RA21 so let’s leave it aside. Could they do it if they con institutions into helping? Perhaps in the name of the Library Value Agenda or similar?

On page 8, the fourth desideratum for RA21 reads: “The system will achieve end-to-end traceability, providing a robust, widely adopted mechanism for detecting fraud that occurs at institutions, vendor systems, and publishing platforms.” Unless the sole imagined fraud prevention method here is cutting off Dr. Scientist’s unpermissioned literature text-mining project without any further attempt to locate or punish her—which I would be okay with, actually—that suggests that on some level and for some time period, patrons’ information use must be traceable back to them by someone. On what level, for what time period, and by whom? And does RA21 allow the information necessary for fraud prevention to be (mis)used for other reasons, say, Big Pigs wanting to sell or exploit it? Who knows? RA21 doesn’t say. It damned well should.

The real meat is in section 2.3 starting on page 14. Line 1 column 2 of the table (minus the headers, you persnickety people) reads “Unless the SP has a specific, contractual agreement with an IdP, the IdP should only send anonymous and pseudonymous identifiers to the SP.” There. There. There is the incredibly giant loophole the big pigs will just waltz right on through. All SPs have to do to snoop individual patrons’ information use is convince IdPs to let them.

We trust IdPs implicitly… why exactly? Dear NISO: It’s called “insider threat.” Look it up. Then add it to your hilariously incomplete security/privacy threat model.

If I had better vidding skills than I do, I’d voiceover a scene from Disney’s Little Mermaid:

Big Pig: Come in. Come in, my librarian. We mustn’t lurk in doorways—it’s rude. One might question your upbringing… Now, then. You’re here because you have a thing for assessment. This, er, learning analytics doohickey. Not that I blame you—it is quite a temptation, isn’t it? Well, awed vocationist, the solution to your problem is simple. The only way to get what you want—is to send me an identifier attribute and ask me to collect the identified data for you.

Librarian: You can do that?!

Big Pig: My dear, sweet librarian. That’s what I do—it’s what I live for. To help underappreciated librarians like yourself. Poor souls with no one else to turn to.

I’ll spare you the actual filk of the song. (If you want to do it, though, go right ahead!)

It gets worse. Proxy servers and IP ranges are generally handled by the library, often with an assist from campus IT. SSO servers, however, are the domain of campus IT, with zero input whatever from the library. Campus IT does not subscribe to library ethics. Campus IT has different pressures on it than the library—most provosts, chancellors, and deans know better than to openly force the library to compromise privacy; the same is not true of campus IT (which often enough doesn’t give a crap about privacy anyway unless forced to). So the Big Pigs can probably have whatever identifiable data they want just by hinting to campus IT or campus administrators that it’d be useful assessment data to have, and wouldn’t you darling people just love to request it from us? All you need to do is send us the identifier attribute that makes it possible!

Not, frankly, that many academic libraries would resist temptation all that much better.

RA21 must put a stop to this scenario if it is to conform to library privacy norms. IdPs must only send identifier attributes in cases of suspected fraud, and when the case is closed, they must stop sending those attributes. SPs must only request identifier attributes in cases of suspected fraud, and when the case is closed, they must cease the request and discard all identified data. Furthermore, IdPs must never grant SPs access to attributes without the library’s explicit approval, and the library may only ever grant such approval for identifier attributes (or indirect identifiers in excess of what is absolutely necessary) in cases of suspected fraud.

Or, to boil it down to its essentials, try this: Neither IdPs nor SPs may sell, share, or use identifiable data, nor may they attempt reidentification of deidentified data, for any reason except as explicitly stated in (what had better be a new and very greatly improved) RA21. I mean, I suspect this to be a pipe dream; NISO hasn’t the guts to tell assessment and learning analytics to take a long walk off a short pier, and the Big Pigs clearly see assessment and learning analytics as an easy lockpick for the identified-data store. But I can dream, can’t I?

Line 2, column 2 expands on this theme: “In addition to the attributes provided by academic libraries, an additional attribute may be sent to the SP to support specific granular usage analysis or charge back requirements that the IdPs may have.” What does “granular” mean exactly? Who in the IdP may initiate such a request, for what reasons? What else can the SP do with this attribute? Who knows? RA21 doesn’t say. It damned well should.

I do approve of the last line on page 15: “In the case that the IdP sends more attributes than the minimal request from the SP, the SP must not collect or store that data under any circumstance.” Keep that. It’s not enough without a definition of “minimal request,” though—a Big Pig could perfectly well ask for every available attribute and call that minimal.

The rest of the spec is devoted to questions of UX. There isn’t a single additional word about privacy or security that I saw. So there we are. Threat model validated!

NISO, spec authors, the RA21 spec should not be approved, much less implemented, without vastly better data-governance protection for patrons against insider threat from IdPs and SPs than it currently contains. Fix it, please.

Teaching adversarial thinking

In case you missed it: A couple months ago a law prof brought on the wrath of academic Twitter by suggesting that students spend a week eavesdropping on the conversations of others to listen for people betraying their own security and privacy, a thing that people quite commonly do. Some of academic Twitter—self included—was initially entranced, until other parts of academic Twitter asked whether casual snoops (or even not-casual snoops) was really an okay thing to turn our students into? Especially when many of our students are still so unaware of the workings of privilege, such that snooping can take on exceptionally sinister overtones applied to certain populations?

So the initially-entranced folks, self included, backed off our initial enthusiasm, and the furor seems to have mostly died down. I, however, am still stuck with a pedagogical problem: as an instructor in introductory information security, I actually do have to teach people to snoop on, and even attack the privacy and security of, other people and systems. I know that sounds horrifying. I know it does! And it definitely gets into some pretty dark-gray gray areas. But stick with me just a bit longer while I explain.

Over a longish period of information-security work, it’s become clear that the only way to have any confidence at all that a system (in the large sense, so not just “a technological system” but “a technosocial system, emphatically including the people involved or even enmeshed in it”) is secure or private (not, of course, the same thing) is to test it by attacking it:

  • To test whether deidentification suffices to anonymize a dataset (spoiler: it rarely if ever does), researchers try to reidentify one or more people in it, often using additional available data to test reidentification via dataset overlap. See, for example, the Narayanan and Shmatikov paper that doomed the Netflix recommender-system contest.
  • To test the security of a given piece of software, you ultimately try to break it. Yes, there are tools (e.g. “Google dorks,” “vulnerability scanners,” “fuzzers,” even Shodan) to locate obvious or common problems, but they’re not enough. A creative, lateral-thinking human being is much better at finding exploitable holes in code than a computer.
  • To prioritize and test for holes in systems (again, “system” writ large), you first think like an adversary—what are the crown jewels in this system, and how would someone who wants them attack the system? This is called “threat modeling,” and thinking-like-an-adversary is a crucial part of it; without that, you end up with what Bruce Schneier calls “movie-plot threats” while ignoring the gaping system problems right under your nose (as, for example, Equifax certainly did). A crucial insight in threat modeling, captured concisely in this xkcd cartoon, is that your enemies always attack with the simplest method likely to work.
  • And once you have your threat model, you test how well your system resists it by, well, attacking your system in the ways you have identified it to be potentially vulnerable! This often happens in the form of “penetration testing,” which can be done on physical systems, social systems, technological systems (such as networks or software), or any combination of the three. My favorite example of a pentest that goes after all three types of system is this absolutely astounding Twitter thread, which I use in my intro course, and after which I named the class’s messing-around server “Jek.”

So I can’t get around it. If I’m to prepare students to take information privacy and security seriously, never mind enter actual infosec and privacy careers, I have to show them how to think like a Garbage Human (which is how I often phrase it in class), and I have to show them how to attack systems (writ large). How do I do this without turning them into Garbage Humans themselves?

This isn’t exactly a new problem in infosec, of course; the learn-to-defend-by-attacking paradox is the earth out of which Certified Ethical Hacker, CIP{M|T|P}, and similar tech-plus-thinking-about-law-and-ethics certifications grew. It’s not even a new problem generally—if we were to strip academe of everything that could be used to Garbage Human, how much of academe would be left? (Yes, yes, plenty of computer scientists claim that computer science would be left. Those computer scientists are wrong, wrong, wrong, wrong, wrong about that.)

What I ended up doing, because I felt more than a little bad about accepting the law-prof’s assignment idea so uncritically, was going back through my syllabus, assignments, and class slides looking for how I’d approached gray areas and put guardrails around students’ growing potential for Garbage Humanning. What I found fell into an actually rather small number of techniques:

  • Clearly and often laying out stuff that’s either illegal or so Garbage Humanny that it should be. For example, I use altering physical mail envelopes as an analogy to various address-spoofing attacks… but I also explicitly point out that mail tampering is amazingly illegal in the US and they shouldn’t do it. In person in the classroom, I am not at all shy about labeling certain practices Garbage Human territory.
  • Giving copious examples of how real people and organizations have been harmed by attack techniques. I can’t control whether my students use what I teach them to Garbage Human. I can control whether they can reasonably use the excuse “I didn’t know this could hurt anybody!” and I definitely try to.
  • When students in my class perform actual reconnaissance, attack, or forensics maneuvers, they’re doing it on me, on themselves (a good habit to get into! and certainly how I prep any assignment where they’ll be looking at me or my data), or on canned datasets created for the purpose (yes, I use the Greg Schardt/Mr. Evil dataset, for lack of one that’s more recent). They’re not doing it on unwitting and possibly-extra-vulnerable targets. Again, the techniques they’re learning absolutely can be repurposed for Garbage Humanning—but I’m clear that I don’t want them doing that, and I don’t give them any actual practice kicking down.
  • Keeping the emphasis on “attack to defend” throughout. They’re not learning adversarial thinking and attack techniques to turn into Garbage Humans, but to equip themselves to defend themselves, their loved ones, and those for whom they are in some way responsible against the depredations of Garbage Humans.
  • Being open about my own dilemmas vis-à-vis Garbage Humanning. For example, I am unbelievably tempted to pull a Narayanan-and-Shmatikov on the Minnesota learning-analytics dataset, the one from several Soria, Nackerud, et al. publications. Even though I don’t actually have that dataset (and don’t want it, good gravy, what a terrifying responsibility), I’d bet Large Sums of Money that knowing the cohort entry year (which, yes, they published) is enough all by itself to find some folks in the dataset via LinkedIn or a date-bracketed Google dork against the University of Minnesota’s website, and I might even be able to find some folks in their painfully-low-n outlier groups. Possible? Unequivocally. Absolutely without question possible. I’m not even good at reidentification and reconnaissance techniques and I am absolutely sure that I can do this. Ethical? … Well, that’s a tough one, which is why I haven’t actually done it.

Is this enough? I don’t know. I’m certainly still kicking the problem around in the back of my head, because if I can do better than I’m doing, I want to.


I started this a while ago, but didn’t have anywhere to share it. California’s tussle with Elsevier gave me reason to update it, so here you go. By all means hear it in your favorite singer’s voice. I favor Rita Moreno, myself.

Never know how much I’d read you
Never know how much I’d click
When you straight-up overcharge me
I get a fever makes my budget sick

(chorus below)
You give me fever when you dis me
Fever when you NDA
Fever!… when you Big Pigs
Make me sign the library away

Journal Big Deals in science
Ebooks the humanities
We do their work when they call our name
They still make us pay APCs


Germany has got the fever
Got it from V-S-N-U
Fever isn’t such a rare thing
Sweden’s gone and got it too

(instrumental interlude, for a change)

All those California budgets
Shattered by the NPG
Elsevier tried double-dipping
But they couldn’t get past old UC


Now you’ve listened to my story
Here’s the point of my tirade
Pigs exist to give you fever
To make you broke and lonely and afraid!

They give you fever when they dis you
Fever if you live and learn
Fever when they piss you
Off—an awful way to burn
What an awful way to burn
What an awful way to burn
What an awful way to burn…

Right now bits and pieces of a “Data Breach Tango” are rattling around in the back of my head. We had it comin’! We had it comin’! We had it comin’ all the time…

Libraries no longer guard patron privacy.

It’s National Library Week, and the American Library Association is marking the occasion with a triumphalist paean to libraries. I don’t mind that in and of itself; triumphalism on libraries’ behalf is part of ALA’s job. But so is truth. And there’s a chunk of this paean that’s not true—not just not true, but importantly false.

Stephen Francoeur on Twitter noticed a contradiction between two squares in the self-congratulatory table on the bottom of page 7 (image used by permission):

Six-cell table bragging on libraries, with the middle bottom cell altered to point out a contradiction in two top-row cells

Francoeur is not wrong, of course—libraries can’t both provide a refuge from surveillance and routinely surveil students to (purportedly) establish links between library use and student success. Shame on ALA for not so much as noticing the contradiction there.

My irritation goes deeper, however, and centers on that top right-hand block, which says: “Librarians have long championed their community members’ right to access information privately, and serve as an essential refuge where everyone can check out materials or browse the internet without their information being shared.”

I’m sorry, but the second half of that sentence is flatly untrue. It is a whopper so gargantuan that if whoppers still enlarged noses, the noses of the people who wrote it would span the ocean and we could hang a transoceanic Internet cable off them. American libraries and librarians cannot truthfully claim this any longer, if indeed they ever could. Let me count just a few ways in which libraries do not prevent—and sometimes even foster—information/data sharing about library-information use and Internet browsing.

Abysmal web-privacy hygiene

Sniffing network traffic is so simple I teach my Intro to Information Security students to do it—and this is a class with zero technical prerequisites. If you’d like to try it yourself—do not do this on a public network, though; that’s snooping on people, which is Not Okay—download Wireshark, fire it up, then:

  1. go to the Capture menu and select Options
  2. make sure the “Enable promiscuous mode” box is checked, then OK out of the options menu
  3. type tcp port http into the capture-filter box at bottom (or select it from the handy dropdown menu), and
  4. double-click (most likely) “en0” for the local wireless network.

Congratulations; you’re sniffing unencrypted web traffic. Hit the red-square button when you’d like to stop doing so. If you actually look at what you get when you do this, you’ll notice that it’s the entirety of whatever pages are being browsed, by whoever is on the network. (Unless someone is using a VPN, granted, but if they are, Wireshark shouldn’t be picking up details of their traffic at all.)

The corollary is inescapable: libraries that have not moved their websites to HTTPS—encrypted them—are allowing random Wiresharkers anywhere to snoop on anyone using their websites, not to mention nosy ISPs, ad-mad airlines, and any number of other creeps. Essential refuge? Not even close.

Academic libraries, for the most part, moved to HTTPS fairly quickly once Let’s Encrypt made it less costly and Google started penalizing insecure pages in search results. American public libraries are horrifyingly behindhand, however. At a presentation last October for the Minnesota Library Association, I demonstrated that of the dozen public-library consortia in Minnesota, only half had fully-encrypted websites (with one more working on it). When I redid that work for what was to be an encore webinar in Wisconsin, I found that of our sixteen public-library systems, only six have fully-encrypted websites (with one more working on it). I started doing this work for other states too, but I am as usual overcommitted, so I can’t finish it. If you can, please do! The crowdsourcing spreadsheet I was building is here.

Neither public nor academic libraries make much if any effort to prevent third-party tracking of Internet use from library computers, never mind library-provided wifi. I don’t know of any libraries that routinely have a tracker-blocking plugin installed in library-computer web browsers. (If your library does, do tell me! I will be delighted!) Dealing with wifi is admittedly harder, but the Pi-Hole project demonstrates that it is at least possible. The benefits would not be limited to improved patron privacy; they would also include vastly lessened bandwidth use and a faster, more pleasant user experience for people browsing on mobile devices.

I don’t even know any libraries who set the library-computer default browser search away from Google to DuckDuckGo. This should be an absolute no-brainer! Google is so evil they’re lobbying against anti-audio-surveillance laws!

Incidentally, if you’re wondering how I can possibly be so blunt about this, let me explain. My Minnesota talk went over like a lead balloon. I rather expected that, having tilted at windmills before, but I was honestly surprised that the issue seems to be that I made specific critiques of specific libraries and library systems. Not cricket, apparently, even when the critiques themselves are fully legitimate. Are we that fragile in librarianship? That unwilling to be confronted with our own weaknesses? Well, that’s not good.

When the Wisconsin Library Service asked me to reprise the talk as a webinar as a favor to them, I agreed. I was even more surprised to receive an email from them a week before the webinar date saying that a Minnesota librarian had told them about my earlier talk fiasco, and asking that I remove from the webinar any and all references to specific Wisconsin libraries and library systems. Since I don’t generally hold with censorship, I refused and canceled the webinar.

So I have most of the Upper Midwest disgusted with me already. It can’t get much worse, which oddly enough confers a certain immunity. I might as well try to use that for good.

Anyway, if blocking trackers is an all-around privacy and UX win, why on earth aren’t libraries doing it?

Trackers, trackers everywhere

Because libraries use web trackers, of course. Usability/UX trackers with horrendous privacy-demolishing terms-of-service agreements, like NewRelic or Ad Nexus. Actual marketing trackers—I will not even consider joining the Wisconsin Library Association until Multiview trackers disappear from its website and WLA has the grace to apologize to all of us, members and non- alike, for Multiview ever being there. Google Analytics, ubiquitous in libraries as everywhere else. Even Doubleclick—which is one of the Foul Horsemen of the Adpocalypse—when libraries aren’t careful about how they embed YouTube videos into websites. (Also, YouTube? Even leaving tracking questions aside, YouTube is the actual worst.)

Libraries. Routinely track. Their website visitors’. Website usage. Essential refuge? Try “willing collaborator in surveillance capitalism,” in the Vidkun Quisling sense of the word “collaborator.”

The e-resource vendors from whom libraries purchase access also routinely track patron information use, and are trying to do so more commonly and more identifiably. For a quick explainer around this in the academic-library context, try Cody Hanson’s CNI talk slides from earlier this week. Public libraries are no better; EBSCO, for example, is a wilderness of trackers. What do libraries do about this? Presently nothing whatever, save for the occasional windy grandiose manifesto with no accompanying procedures or implementation plans—empty talk, in other words. Libraries do not forbid vendor tracking of patrons in their content licenses. Libraries do not prevent vendor tracking via their systems.

I mentioned a tracking hygiene issue specific to academic libraries in a prior post: proxy-server logs, which inevitably track and retain identified records of patron information use, are not routinely deleted. Essential refuge my size-11 foot.

Surveillance in assessment practices

But web tracking is just a special case of a more general practice: assessment by surveillance. From the hideous edtech surveillance monsters attacking K-12 education to “customer relationship management” surveillance in public libraries to library learning analytics in higher education—it’s all surveillance, and it all includes surveillance of patron information use.

I’m tired and this rant is long enough already, so I’ll save a longer dissection of assessment surveillance for another day. Suffice to say that just in the realm of library learning analytics, the surveillance rises to toweringly creepy and horrifying levels without even the slightest nod toward consent practices, ethical review is inadequate where it exists at all, and data hygiene and deidentification practices are frankly obscene. (I will at some point emit another rant about deidentification vs. anonymization and why the latter is an utter impossibility—but I’m tired and this rant is long enough already.)

Essential refuge? No refuge at all.


American libraries no longer guard patron privacy. We can do so again if we choose, but it will take quite a bit of work, not to mention considerably more self-restraint faced with the vastly-increased ease of digital surveillance than we have yet shown.

Until then, we need not to lie about it. Lying is a Facebook trick, as I described at length in a talk I gave yesterday for the local IT Leadership Conference (and promised to post here, so now I have). We should be above lying, we who prize truth.

Unizin Not-Common-Knowledge Data Model

I’m doing a talk this Tuesday for a campus IT conference. Should be a good time, for certain values of that phrase. I’ll post a link to the slides here afterwards.

While writing the talk—I’m one of those dorks who does script out talks word-for-word, though I do it in the slangy, choppy rhetorical style I actually talk in; academese is not my speech register and I don’t pretend it is—I ran across the Unizin Common Data Model, which if I understand correctly underlies the giant data warehouse for student data called the Unizin Data Platform. This will hold data from all Unizin member institutions.

To Unizin’s credit, they have a data dictionary publicly available, though every time I’ve tried to get just the table listing (or ERD?) it hasn’t worked. Still, the list of column/attribute names is there, and this list is a swift and daunting education in student Big Data.

See for yourself by all means, but here are some specific areas of the table I suggest looking at:

  • Everything from the Incident and IncidentPerson tables (conveniently, the table name is the first column in the data dictionary and is how the dictionary is ordered), especially the RelatedToDisabilityManifestationInd column
  • the LearnerAction and LearnerActivity tables, noting for the record that hashing the LearnerID is not anything like a sufficient privacy guarantee
  • the Person table and related tables, which are detailed to an extent that gives me nightmares

Have fun asking yourself why on earth a learning-management system needs to know all this… and considering the Equifax-level horror if there is ever a breach in it.