Minimizing Disclosure Risk in HHS Open Data Initiatives. 1. Panel Presentations


Khaled El Emam noted that de-identification has been simplified through automation. The process of de-identification in practice involves assessing risk, classifying the variables in the file, and mapping the data. These contribute to specifications in an automated anonymization engine through which the original data are run to produce the anonymized data for release.

Adversaries (that is, those who might re-identify the data) may include academia, the media, a person’s acquaintances, the data recipient, and malicious actors. Interestingly, there is no apparent economic case for malicious re-identification of health data; the bigger concern is the media.

There are direct identifiers and quasi-identifiers. Examples of direct identifiers include name, address, telephone number, fax number, medical record number, health care number, health plan beneficiary number, voter identification number, license plate number, email address, photograph, biometrics, Social Security number, device number, and clinical trial record number. Examples of quasi-identifiers include sex, date of birth or age, geographic locations (such as postal codes, census geography, information about proximity to known or unique landmarks), language spoken at home, ethnic origin, total years of schooling, marital status, criminal history, total income, visible minority status, profession, event dates, number of children, high-level diagnoses and procedures.

An identifier must satisfy three general criteria: it must be replicable, distinguishable, and knowable. Replicable means that the identifier is sufficiently stable over time and has the same values for the data subject in different data sources (for example, blood glucose level is not replicable, but date of birth is replicable). A potential identifier is distinguishable if there is sufficient variation in the values of the field that it can distinguish among data subjects (example: a diagnosis field will have low distinguishability in a database of only breast cancer patients but high distinguishability in a claims database). An identifier must be knowable by an adversary, and how much an adversary knows will depend on whether the adversary is an acquaintance of the data subject or not. If an adversary is not an acquaintance, the types of information that are available include inferences from existing identifiers, such as date of hospital discharge at birth and public data such as voter registration lists.

Determining risk is a solvable computational problem. Make assumptions about the knowledge of the adversary and how many quasi-identifiers it has, consider all combinations of these, and then manage the risk for every combination.

Some special types of data require specialized techniques. There are good techniques to de-identify geo-spatial information (including movement trajectories), dates and long sequences of dates (for example, transactional data), and streaming data—that is, data that is continuously being updated.

If de-identified properly, open data is not particularly useful for further attacks because it has no identifiable information, and the success rate of linking these data to other data should be small. Decent data can be created for public release, and we can add terms of use or conditions in order to release higher quality data.

Brad Malin described the de-identification system for DNA sequence data that his team constructed. The database contains 2 million patients and biospecimens for 200,000 patients, and the data is being used by 200 researchers (subject to a DUA with the National Institutes of Health).

His team published a paper in June 2014 on a probabilistic model for patient disclosure based on estimating population uniqueness across datasets (Sattar et al. 2014). One needs to be cognizant of data over time: if a data holder anonymizes someone in different ways at different points in time, this may actually make that person easier to identify.

Research has shown the variety of characteristics and behaviors that can distinguish an individual. These characteristics and behaviors include demographics, diagnosis codes, lab tests, DNA, health survey responses, location visits, pedigree structure, movie review, social network structure, search queries, Internet browsing, and smart utility meter usage. A study he conducted found that re-identification risk was substantially greater for a HIPAA limited dataset than a dataset protected with HIPAA Safe Harbor methods.

A simplified view of risk is that the probability of re-identification is approximately equal to the product of the probability of an attack, and the probability of a re-identification conditional on an attack. Deterrents to attack include DUAs, access gateways, unique login IDs and passwords, and audits. Data characteristics that affect the conditional probability of a re-identification include uniqueness, replicability, availability, and cost.

Latanya Sweeney began her remarks by noting that this conversation is not much different than it was in 1997, but the world has changed a lot since then. The Data Privacy Lab at Harvard University initiated the DataMap project ( to document where personal health data goes outside of the doctor-patient relationship. Maps show the flow of data from the patient to various entities and from the physician and hospital back to the patient. Flows that do not directly involve the patient are numerous, and less than half of the documented data flows are covered by HIPAA, including inpatient discharge data transmitted without explicit identifiers.

A study she led found that only three of the 33 states that sell or share de-identified versions of their hospital inpatient discharge data are using HIPAA standards to protect the data. In a separate study, her team purchased a public use version of patient-level hospital discharge data from Washington State, and using accounts of accidents published in newspapers in 2011, they was able to re-identify 43 percent of a sample of 81 accident victims in the hospital discharge data based on characteristics reported in both sources.

With colleagues she submitted a FOIA request to determine who are the buyers of publicly available health data, and found that predictive analytic companies are the big buyers. They are producing data products that exploit publicly available health data.

There are four ways to add transparency to the system: (1) public notice of privacy breaches should be required; (2) data holders should be required to list publicly those with whom they share data; (3) each person should be able to acquire copies of their personal data from any entity holding their data; and (4) each person should also be able to acquire an audit trail of the of the organizations with which the data was shared.

Re-identification is a key part of the cycle of improving the protection of data. We improve protective techniques only after protections fail. For example, encryption techniques have improved because they were used, problems were identified, and better techniques were developed. We now have strong encryption, and we need the prevention of re-identification to advance to that stage as well.

Denise Love explained that NAHDO has been involved for years in discussions regarding these issues with states. The state data agencies have come up solutions to balance transparency and confidentiality.

The state inpatient discharge and all-payer claims data systems are essential to public health and multiple other purposes, including public safety, injury and disease surveillance, health planning, market share analyses, quality assessments and improvement, and identification of overuse/underuse/misuse of health care services.

There is a critical “iron triangle” to public data, representing three principles of data policy: transparency, data utility, and data safety. There must be a balance among all three. Over-emphasis on any one of the three does not serve the public good.

DUAs can mitigate the risk of inappropriate use. The Washington state story is the first breach that we’ve ever heard about. NAHDO spent a year developing guidelines for data release by states, which was published in January 2012, but Washington State was not following these guidelines.

Daniel Barth-Jones discussed his recent work using uncertainty analysis through a flow chart that lays out several components including intrusion scenarios and information on what variables are needed by an intruder for re-identification. Adding an uncertainty distribution at each step of the flowchart gives a sense of how the data protection and disclosure avoidance techniques can reduce re-identification risk.

Intrusion scenarios include a “nosy neighbor” attack, a mass marketing-type attack to re-identify as many individuals as possible for marketing purposes, and a demonstration attack by a researcher in academia or a journalist. There could be as many as 3,000 potential variables/data elements. However, since most often the data is not necessarily accurate and the intruder cannot build a complete population register, there are often false positives. Each step in the flow chart has a probabilistic distribution—then you can sample across the scenario with a hyper-grid multiple times, which gives us a robust idea of the re-identification risk. There are dependencies at each step in the chain to determine the economic motivation or benefit to the entity.

It is important to consider the impact of de-identification on statistical analysis. Poorly implemented de-identification can distort multivariate relationships and hide heterogeneities. Data reduction through sampling and other means can destroy the ability to identify heterogeneity among the races, or by educational level, for example.

A forthcoming paper by T.S. Gal et al. evaluates the impact of four different anonymization methods on the results obtained from three different types of regression models estimated with colon cancer and lung cancer data. For each combination the authors calculated the percentage of coefficients that changed significance between the original data and the anonymized data.

HIPAA lacks a penalty if data is re-identified by the user, even if these are false positives; currently there is no cost for false positive identification. We need to change the cost for false positive identification to change the economic incentives for efforts at re-identification.

View full report


"rpt_Disclosure.pdf" (pdf, 1.01Mb)

Note: Documents in PDF format require the Adobe Acrobat Reader®. If you experience problems with PDF documents, please download the latest version of the Reader®