Minimizing Disclosure Risk in HHS Open Data Initiatives. A. Re-identification of Individuals in Data Released to the Public


There are exceedingly few documented instances of the re-identification of individual persons in datasets that have been released to the public. None has involved a sample survey or a federal government database, and few have involved data that were protected by methods that would be considered rigorous by today’s standards.

The most famous re-identification, which predated the HIPAA Privacy Rule and influenced its development, was Latanya Sweeney’s 1996 re-identification of a substantial fraction of the records in a database of Massachusetts state employees discharged from hospitals (Cavoukian and Castro 2014). The records had been de-identified by removal of names, Social Security numbers, health insurance IDs, hospital names, doctors’ names, and other obvious identifiers, but ZIP codes, sex, and date of birth had been retained because of their analytic value. Sweeney’s re-identification used the values of these three variables obtained from a city voter registration list that was purchased for a nominal fee. The employees who were re-identified included the state’s governor. With this work Sweeney (1997) demonstrated that the removal of explicit identifiers does not guarantee that records are anonymous—that is, unable to be associated with individual persons.

Underscoring the latter point, Sweeney et al. (no date) used voter registration and on-line public records to re-identify a subset of personal records in a public database of medical and genomic information. Participants in the Personal Genome Project could choose to make their data public, and while the profiles were not explicitly identified, the authors report that the consent forms did not guarantee privacy. Out of 1,130 profiles that were made public as of September 1, 2011, 579 or 51 percent included the participant’s full date of birth, gender and five-digit ZIP code. Using a sample of voter registration records acquired from a commercial source for the ZIP codes represented among the 579 profiles, the authors were able to match 130 of the profiles uniquely to individual voter records and obtain names. The authors were also able to match 156 of the profiles to the on-line public records, although nearly half of these duplicated matches to the voter records. Genome project staff confirmed that 93 percent of the names obtained from the voter records and 87 percent of the names obtained from the on-line public records agreed with the names recorded in the profiles, and allowance for nicknames would have raised these rates even higher. These results provide further evidence of the uniqueness of many combinations of ZIP code, date of birth, and gender. Indeed, Sweeney (2000) estimated that 87 percent of the U.S. population could be uniquely identified by the combination of 5-digit ZIP code, date of birth, and gender. 3

El Emam et al. (2011) conducted a systematic review of known re-identification attacks on health data and other types of data. The review uncovered 14 re-identification attacks in which at least one individual was accurately re-identified. Of the 14 examples, 11 were conducted by researchers solely to demonstrate or evaluate the risk of re-identification. Notably, only 2 of the 14 involved databases that were protected in accordance with current standards. One of the two was a health database, consisting of records from a regional hospital that were protected with the HIPAA Safe Harbor Privacy Rules, and the rate of re-identification was found to be very low—just 0.022 percent, representing two persons—despite strong assumptions about what an intruder might know (see Kwok and Lafky 2011). Overall these results confirm the value of current best practices for de-identification but also indicate that there is merit in complementary legal protection, where possible. The study also highlights a need for better information on disclosure risk, which could be obtained from re-identification attacks on large databases protected with the best current methods.

Another example of re-identification, which received considerable attention in the media, was the re-identification of published Netflix rental histories from the movie reviews submitted by (identified) Netflix customers (see Narayanan and Shmatikov 2008). Although this example does not bear directly on the risks associated with federal data in general or health data in particular, it demonstrates what can be possible with data that are publicly available.

3 A replication of Sweeney’s calculations with 2000 census data found that the percentage of the population that could be uniquely identified with these same variables had fallen to 63 percent (Golle 2006).

View full report


"rpt_Disclosure.pdf" (pdf, 1.01Mb)

Note: Documents in PDF format require the Adobe Acrobat Reader®. If you experience problems with PDF documents, please download the latest version of the Reader®