Minimizing Disclosure Risk in HHS Open Data Initiatives. Session 3: What Are the Re-identification Threats to Releasing Federal Data to the Public?


Steve Cohen, AHRQ (moderator)
Daniel Barth-Jones, Columbia University
Khaled El Emam, University of Ottawa and Privacy Analytics
Denise Love, National Association of Health Data Organizations (NAHDO)
Brad Malin, Vanderbilt University
Latanya Sweeney, Federal Trade Commission and Harvard University

Cohen: We need to be forward looking—let’s think about what’s coming during the next 3, 5, or 10 years to address potential threats.

Khaled El Emam:

De-identification has been simplified through automation. In a graphic representation, the process of de-identification in practice involves assessing risk, classifying the variables in the file, and mapping the data. These contribute to specifications in an automated anonymization engine through which the original data are run to produce the anonymized data for release. Our organization has 10 years’ experience in helping clients share health data. We published three books on this topic last year.

Who is an adversary? This can include academia, the media, acquaintances (neighbor, ex-spouse, employer, relative, co-worker), the data recipient, malicious actors. There is no apparent economic case for malicious re-identification of health data. The bigger concern is the media.

There are direct and quasi-identifiers. Examples of direct identifiers include name, address, telephone number, fax number, medical record number, health care number, health plan beneficiary number, voter identification number, license plate number, email address, photograph, biometrics, Social Security number, social insurance number, device number, clinical trial record number. Examples of quasi-identifiers include sex, date of birth or age, geographic locations (such as postal codes, census geography, information about proximity to known or unique landmarks), language spoken at home, ethnic origin, total years of schooling, marital status, criminal history, total income, visible minority status, profession, event dates, number of children, high-level diagnoses and procedures.

An identifier must satisfy three general criteria. It must be replicable, distinguishable (that is, variable), and knowable. Replicable means that the identifier is sufficiently stable over time and has the same values for the data subject in different data sources. For example, blood glucose level is not replicable, but date of birth is replicable. A potential identifier is distinguishable if there is sufficient variation in the values of the field that it can distinguish among data subjects. A diagnosis field will have low distinguishability in a database of only breast cancer patients but high distinguishability in a claims database. An identifier must be knowable by an adversary.

The likelihood of its being known has to be high. How much an adversary knows will depend on whether the adversary is an acquaintance of the data subject or not. It may also depend on the expected resources of the adversary.

If an adversary is not an acquaintance, the types of information that are available include inferences from existing identifiers—for example, determining birth date from the date of the hospital discharge at birth; public data such as voter registration lists (available for free in some states), white pages, and whatever the subject reveals in public forums; semi/quasi-public data having a nominal fee or terms of use, such as voter registration lists in other states; and non-public sources such as commercial databases or provider databases that have significant costs to acquire.

How do we protect confidentiality when there are multiple quasi-identifiers? How much will an adversary know? If there are 10 quasi-identifiers, what assumptions do we make about the knowledge of the adversary? These are assumptions about adversary power. Assume, for example, that an adversary will know only 5 of the 10 quasi-identifiers. We can consider all combinations of 5 things and manage the risk for every combination. This becomes a solvable computational problem.

Some special types of data require specialized techniques. There are good techniques to de-identify geo-spatial information (including movement trajectories), dates and long sequences of dates (for example, transactional data), and streaming data—that is, data that is continuously being updated.

What is the impact of the mosaic effect? If de-identified properly, open data is not particularly useful for further attacks because it has no identifiable information, and the success rate of linking these data to other data should be small. Will the risks increase over time? Probably. But we have the same case with encryption algorithms, yet we still encrypt data. Should we release data publicly? Decent data can be created for public release. We can add terms of use or conditions in order to release higher quality data.

I also wonder about the cost-effectiveness of RDCs. They are not used very much. I recently conducted research at a Statistics Canada RDC, and the only other person I ever saw was the center librarian.


We have constructed a de-identification system for DNA sequence data. Our database uses de-identification techniques for 2 million patients in the Vanderbilt system. The data is being used by 200 researchers, and we have biospecimens for 200,000 patients. Researchers may use the data, subject to a DUA with the National Institutes of Health (NIH).

We published a paper two weeks ago on a probabilistic model for patient disclosure based on estimating population uniqueness across datasets (Sattar et al. 2014 in Knowledge-Based Systems). One needs to be cognizant of data over time. If you anonymize someone in different ways at different points in time, this may actually make that person easier to identify.

Research has shown the variety of characteristics and behaviors that can distinguish an individual. This includes demographics, diagnosis codes, lab tests, DNA, health survey responses, location visits, pedigree structure, movie review, social network structure, search queries, internet browsing, and smart utility meter usage.

A colleague and I showed that the potential number of individuals who could be identified with demographic data from voter registration lists and the cost per person identified varied dramatically across a subset of states. The risk was substantially greater for a HIPAA Limited Dataset than a dataset protected with HIPAA Safe Harbor methods.

We are working on research to understand the incentives behind re-identification. A simplified view of risk is that the probability of re-identification is approximately equal to the product of the probability of an attack and the probability of a re-identification conditional on an attack. Our incentive system is broken. Incentives exist for researchers to re-identify and publish the results. This, in turn, may allow private industry to learn re-identification techniques from published literature. Deterrents to attack include DUAs, access gateways, unique login IDs and passwords, and audits. Data characteristics that affect the conditional probability of a re-identification include uniqueness, replicability, availability, and cost.

Latanya Sweeney:

This conversation is not much different than it was in 1997, but the world has changed a lot since then. Technology is constantly clashing with society—it’s a matter of trust.

Under my direction, the Data Privacy Lab at Harvard University initiated the DataMap project ( to document where personal health data goes outside of the doctor-patient relationship. Maps show the flow of data from the patient to various entities and from the physician and hospital back to the patient. These data maps also show the flow of the patient’s personal health data from the immediate recipients to numerous other entities. The maps indicate that much of the data is transmitted with personal identifiers although some is not. Flows that do not directly involve the patient are numerous.

Less than half of the documented data flows are covered by HIPAA, including inpatient discharge data transmitted without explicit identifiers. Almost all states collect inpatient discharge data, and 33 states sell or share de-identified versions of their discharge data. HIPAA does not cover these data, and only 3 of the 33 states are using HIPAA standards to protect the data, according to a 2013 survey we conducted.

Recently, I purchased a public use version of patient-level hospital discharge data from Washington State. Using accounts of accidents published in newspapers in 2011, I was able to re-identify 43 percent of a sample of 81 accident victims in the hospital discharge data based on characteristics reported in both sources. The kinds of information reported in news stories is often known by others, including family, friends, neighbors, employers, and creditors.

Data brokers make data available very cheaply compared to the states.

We did a FOIA request to determine who were the buyers of publicly available health data. Predictive analytic companies are the big buyers. They are producing data products that exploit publicly available health data.

There are four ways to add transparency to the system. First, public notice of privacy breaches should be required. Second, data holders should be required to list publicly those with whom they share data. Third, each person should be able to acquire copies of their personal data from any entity holding their data. Fourth, each person should also be able to acquire an audit trail of the of the organizations with which the data was shared.

Re-identification is a key part of the cycle of improving the protection of data. We improve protective techniques only after protections fail. Encryption techniques have improved because they were used, problems were identified, and better techniques were developed. We now have strong encryption. We need the prevention of re-identification to get there as well.

A written summary is available at 04/transparency-establishes-trust.

Denise Love:

We have been involved for years in discussions regarding these issues with states. We are proud of the solutions that states have come up with to balance transparency and confidentiality. These data systems are essential to public health and multiple other purposes. Documented uses of state health care databases include:

  • Public safety, injury surveillance, and prevention
  • Disease surveillance, public health registries
  • Health planning, such as community needs assessments, hospital conversions and closures
  • Market share analyses, hospital strategic planning
  • Quality assessments and improvement, patient safety, outcomes studies
  • Public reporting, informed purchasing (outcomes and charges)
  • Transparency
  • Health systems performance
  • Identification of overuse, underuse, and misuse of health care services

There is a critical “iron triangle” to public data, representing three principles of data policy: transparency (public availability and data information), data utility, and data safety. There must be a balance among all three. Over-emphasis on any one of the three does not serve the public good.

Public data resources are valuable assets. A one-size-fits-all approach to data governance is not feasible. Each data system has a set of stakeholders, laws, and history. Useful data can be shared while controlling for risks using statistical methods, management controls, and web query systems. We hear more complaints about the lack of data sharing than about the risks.

DUAs can mitigate the risk of inappropriate use. The Washington state story is the first breach that we’ve ever heard about. NAHDO spent a year developing guidelines for data release by states, which were published in January 2012, but Washington state was not following these guidelines.

Daniel Barth-Jones:

My recent work is using uncertainty analysis through a flow chart that lays out several components including intrusion scenarios and information on what variables are needed by an intruder for re-identification. I add an uncertainty distribution at each step of the flowchart to give a sense of how the data protection and disclosure avoidance techniques can reduce re-identification risk. I have included intrusion scenarios such as a “nosy neighbor” attack; a mass marketing type attack to re-identify as many individuals as possible—for marketing purposes; or a demonstration attack by a researcher in academia or a journalist, to try to identify individual or random people just to show vulnerability or seek attention in order to influence public policy. The flow charts that I have developed include different data elements—variables that pose a risk of de-identification—needed by the intruder for different intrusion scenarios. There could be as many as 3,000 potential variables. However, since most often the data is not necessarily accurate and the intruder cannot build a complete population register, there are often false positives. Each step in the flow chart has a probabilistic distribution—then you can sample across the scenario with a hyper-grid multiple times. This gives us a robust idea of the re-identification risk. The chart may include a model of trade-offs between the cost to protect the data versus the volume of disclosure. Playing up single re-identifications may convey the wrong message to policymakers. There are dependencies at each step in the chain to determine the economic motivation or benefit to the entity.

It is important to consider the impact of de-identification on statistical analysis. Poorly implemented de-identification can distort multivariate relationships and hide heterogeneities. This can be illustrated using plots of data from census public use microdata samples, where each dot is a combination of three quasi-identifiers: age, income, and education in years. Each color represents a different race. Data reduction through sampling and other means can destroy the ability to identify heterogeneity among the races. Starting with two percent sample data, I show the percentage of records that are population unique (3.5 percent), sample unique but not population unique (40.6 percent), and not unique (56 percent). When education in years is replaced with six education categories, the population unique are reduced to zero percent, the sample uniques that are not population unique are reduced to 8 percent, and the fraction that is not unique is increased to 92 percent. If education is dichotomized as greater than high school graduation versus less than or equal to high school graduation, the sample unique that are not population unique are further reduced to just 0.6 percent while those that are not unique are increased to 99.4 percent. If education is removed, no records are unique.

A forthcoming paper by T.S. Gal et al. evaluates the impact of four different anonymization methods on the results obtained from three different types of regression models estimated with colon cancer and lung cancer data. For each combination the authors calculated the percentage of coefficients that changed significance between the original data and the anonymized data. Depending on the de-identification technique that was used, the health dataset, and the type of regression model that was evaluated, these percentages varied but were mostly non-trivial, ranging between 40 and 80 percent for the one de-identification technique that fared worst. The best method, one proposed by the authors, was consistently below 20 percent.

HIPAA lacks a penalty if data is re-identified by the user, even if these are false positives. Robert Gelman has proposed a personal data de-identification act. Currently, there is no cost for false positive identification. We need to change the cost for false positive identification to change the economic incentives for efforts at re-identification.


Cohen: I identified the following themes during these presentations: Brad discussed game theory; Latanya noted data sources that could result in a breach and observed that even one breach is too many; and Daniel discussed how to simulate the threat. Where are we heading in the next five years to address these threats?

Malin: Social media are a serious threat. People self-disclose and disclose about others. We are doing research on how Twitter is used (it turn out that people talk more about others than about themselves), and it is a minefield of potential disclosures (“pray for my mom; she has breast cancer”). Another challenge is that electronic health records are becoming commercialized, and start-ups are using data without regulation—this is a big loophole.

Sweeney: No one is really studying the predictive analytics industry, so we don’t know how big an industry it is. Re-identification is a way of illustrating risk—it’s big although unquantified. We don’t know how much really goes on. DUAs don’t stop it; they just hide it, because the penalties are so draconian. The focus of our conversation should not be on re-identification, but rather on disclosure risk. The future of releasing data to the public should not be from large private-sector organizations such as Google or Facebook. Instead, federal agencies should try to figure out how to link data in a secure way in the cloud to produce aggregated data for the public.

Barth-Jones: The future concern is harm from bad de-identification practice—from bad science and inefficiency. We should focus on reducing bad de-identification practices.

Love: Washington State withdrew from a best practices process. In the future, discharge and all payer data will be derived from claims, but then it becomes a statistical abstract rather than identifiable data. I’m worried that data will be too protected, and opt-in/opt-out will be disastrous for public health and for population health (for example, when parents do not vaccinate their children).

El Emam: Techniques are becoming more sophisticated, including protection of data. Adoption of good de-identification practices has been less than ideal. Risks can be managed with appropriate practices.

Malin: We have been having a four-year dialogue with the community regarding use of data for research purposes. A catastrophic breach will shut down research efforts. You need to involve the community in these discussions—create an advisory board and keep them in place, and make them partners. This reduces the risk of research being shut down in the event of a breach.

View full report


"rpt_Disclosure.pdf" (pdf, 1.01Mb)

Note: Documents in PDF format require the Adobe Acrobat Reader®. If you experience problems with PDF documents, please download the latest version of the Reader®