Minimizing Disclosure Risk in HHS Open Data Initiatives. 1. Re-identification of Individuals in Data Released to the Public


There are exceedingly few documented instances of the re-identification of individual persons in datasets that have been released to the public. None has involved a sample survey or a federal government database, and few have involved data that were protected by methods that would be considered rigorous by today’s standards.

The most famous re-identification, which predated the HIPAA Privacy Rule and influenced its development, was Latanya Sweeney’s 1996 re-identification of a substantial fraction of the records in a database of Massachusetts state employees discharged from hospitals (Cavoukian and Castro 2014). The records had been de-identified by removal of names, Social Security numbers, health insurance IDs, hospital names, doctors’ names, and other obvious identifiers, but ZIP codes, sex, and date of birth had been retained because of their analytic value. Sweeney’s re-identification used the values of these three variables obtained from a city voter registration list that was purchased for a nominal fee. The employees who were re-identified included the state’s governor. With this work Sweeney (1997) demonstrated that the removal of explicit identifiers does not guarantee that records are anonymous—that is, unable to be associated with individual persons. Under-scoring the latter point, Sweeney (2000) estimated from 1990 census data that 87 percent of the U.S. population could be uniquely identified by the combination of 5-digit ZIP code, date of birth, and gender.14

El Emam et al. (2011) conducted a systematic review of known re-identification attacks on health data and other types of data. The review uncovered 14 re-identification attacks in which at least one individual was accurately re-identified. Of the 14 examples, 11 were conducted by researchers solely to demonstrate or evaluate the risk of re-identification. Notably, only 2 of the 14 involved databases that were protected in accordance with current standards. One of the two was a health database, consisting of records from a regional hospital that were protected with the HIPAA Safe Harbor Privacy Rules, and the rate of re-identification was found to be very low—just 0.022 percent, representing two persons—despite strong assumptions about what an intruder might know (see Kwok and Lafky 2011). Overall these results confirm the value of current best practices for de-identification but also indicate that there is merit in complementary legal protection, where possible. The study also highlights a need for better information on disclosure risk, which could be obtained from re-identification attacks on large databases protected with the best current methods.

Another example of re-identification, which received considerable attention in the media, was the re-identification of published Netflix rental histories from the movie reviews submitted by (identified) Netflix customers (see Narayanan and Shmatikov 2008). Although this example does not bear directly on the risks associated with federal data in general or health data in particular, it demonstrates what can be possible with data that are publicly available.

14 A replication of Sweeney’s calculations with 2000 census data found that the percentage of the population that could be uniquely identified with these same variables had fallen to 63 percent (Golle 2006).

View full report


"rpt_Disclosure.pdf" (pdf, 1.01Mb)

Note: Documents in PDF format require the Adobe Acrobat Reader®. If you experience problems with PDF documents, please download the latest version of the Reader®