Minimizing Disclosure Risk in HHS Open Data Initiatives. Synthesis


Data collected by the federal government are covered by a substantial array of regulations intended to protect the confidentiality of the data and the privacy of those from whom such data are obtained. While remaining attentive to these requirements, federal agencies have been able and willing to provide researchers across a wide range of disciplines with open access to public use versions of their data. While some users can accept forms of restricted access in return for more extensive content or an ability to link records across datasets, many applications require a level of access that can only be supplied by public use files, for which federal agencies have gone to great lengths to protect respondent confidentiality.

There have been no documented breaches of federal public use data—survey or administrative. And while occasional, documented breaches involving non-federal data have received a lot of attention, these breaches have been confined almost entirely to data that were not protected in ways consistent with current standards.

Despite this strong track record, there was general agreement among the panelists that disclosure risk cannot be driven to zero without seriously weakening the analytical value of a dataset.

Sources and Evidence of Disclosure Risk. In general, the removal of direct identifiers is not sufficient to de-identify a dataset, as some of the remaining variables can serve as indirect or quasi-identifiers. In combination, they can uniquely identify individuals within the population or well-defined subpopulations. Voter registration files, which typically contain name, address, gender, and date of birth, have figured prominently in a number of the past breaches reported in the literature because a substantial proportion of the U.S. population bears a unique combination of gender, date of birth, and ZIP code.

The Health Insurance Portability and Accountability Act (HIPAA) requires that covered entities de-identify the data they share using methods that would protect against a wide range of potential threats, yet only 3 of the 33 states that sell inpatient discharge data are applying HIPAA standards. Recently, Sweeney was able to re-identify a substantial proportion of a small subsample of individuals in hospital discharge data released by the State of Washington, based on information compiled from newspaper accounts. It must be noted, however, that the Washington State data did not comply with widely used standards developed by the National Association of Health Data Organizations. This episode underscores the potential vulnerability of public use data to the kinds of information becoming more widely accessible through Internet searches and web scraping techniques.

While the Washington State data may represent an exception among data distributed by states, many sales or transfers of personal data—particularly health data and financial data—are neither regulated by government nor held to professional standards designed to minimize disclosure risk. Consequently, such data may be especially vulnerable to re-identification.

The Mosaic Effect and Related Threats. There was a consensus among the TEP participants that the public use files released by HHS and other federal agencies bear limited risk of disclosure, even in combination with other publicly available data. If de-identified properly, the public use data released by federal agencies are not useful to the intruder. In general, the confidentiality of federal datasets is threatened less by the release of other federal data than by data from two other sources: (1) datasets compiled by other organizations and released with weak or no de-identification, and (2) personal data revealed through social media. While agencies can control what they release, they cannot control what other organizations and private individuals release. Federal agencies recognize the growing threat from external data and are actively engaged in assessing and responding to the disclosure risks that they pose.

The explosion of personal information posted to the Internet through social media and other avenues means that the availability of data that might be of use to a potential intruder has grown as well. Such information does not cover the entire population, however. Its coverage is far less extensive than voter registration records, for example, and some investigators have noted the impact of incomplete coverage.

If the confidentiality of a dataset has been breached to the extent that many of the records have been correctly re-identified, then all of the variables contained in that dataset for these named individuals become available as potential indirect identifiers that could be used to break into another dataset containing some of the same individuals and some of the same variables. In actuality, the threat posed by the re-identification of records in a single database covering a narrow subset of the population is likely to be very small, given the limited number of records involved and their minimal overlap with records released by the federal government—particularly from sample surveys.

The release of well-protected federal files does not appear to increase the re-identification risk for other federal files; however, a number of agencies are conducting informed risk assessments. The sheer volume of data files made accessible through the Open Data Initiative is striking. Are there large numbers of individuals who appear repeatedly because of the way that file universes and samples are defined? This may be difficult if not impossible to determine, given restrictions on sharing or linking personal data across agencies, but risk assessment would be enhanced with such information.

Protecting Public Use Data. To maximize their effectiveness, statistical disclosure limitation methods must be tailored to the data they are being used to protect. Each dataset faces unique risks, depending on the type of data, the population covered, the sample design, the variables that require the most protection, and the distribution of values on these variables.

Statistical Policy Working Paper 22 has provided valuable documentation of federal agency practice in protecting the confidentiality of federal data. However, the last update of this important resource was nearly 10 years ago, and agencies have upgraded their statistical disclosure limitation methods since then. TEP panelists asserted that regular updates—perhaps as often as every five years—would help to ensure that the document remains current.

A useful way to represent disclosure risk is that the probability of a re-identification is equal to the product of the probability of an attack, and the probability of a re-identification conditional on an attack. Disclosure risk can be lowered by strengthening the protections applied to public use data or by taking steps that reduce the likelihood of an attempted re-identification. That no breaches of federal data have occurred to date may be due in part to the fact that incentives were not high enough to inspire serious efforts to challenge the disclosure protections, although the protections themselves may have contributed to reducing these incentives.

Working in the opposite direction, however, the penalties to which data users are subject for attempting to re-identify records in public use files are light to non-existent. For most agencies the onus falls entirely upon the data producers in the event of a re-identification. This creates a tension for agencies releasing public use files in order to comply with Open Data policies and initiatives.

The confidentiality of public use data is reinforced in ways besides de-identification. Sampling is an important tool for reducing disclosure risk. This speaks to the security of the federal government’s many public use files of sample survey data but underscores the inherent risks in creating public use files from administrative records, which may not be sampled at all or are sampled at much higher rates than is typically found in surveys.

Reporting error, which can be particularly high in sample surveys, also provides protection against disclosure. Sometimes the application of additional protection in the form of masking may not be needed and may only reduce the quality of the data.

The effects of disclosure limitation methods on the quality of public use data are an important concern. Over-zealous confidentiality protection can weaken data quality with little improvement in data security. It is possible to apply such strong protections to public use data that they become useless—and unused. To guard against this, some agencies consult with subject matter experts and major data users to better assess the trade-offs between analytic utility and effective disclosure limitation. Secure remote access is growing as a solution to the problem of maintaining quality while protecting confidentiality, but it cannot serve all data needs.

False re-identification is generally not addressed in regulations, yet it may present a more serious problem, potentially, than positive re-identifications. From a technical standpoint, an agency can protect against a positive (or true) re-identification but not a false re-identification. Furthermore, agencies cannot deny alleged re-identifications except to reiterate that a re-identification is not possible, as an explicit denial may reveal information that assists an intruder with a correct re-identification.

Evaluation of the effectiveness of disclosure limitation methods by attempting to re-identify records internally remains the most powerful approach to establishing that a public use file is secure. A number of agencies obtain external, identified data—both public and commercial—to use in their evaluations, which may enable more realistic assessments of risk.

Frontiers of Research. Research is providing important enhancements to risk assessment and the ability to assign probabilities to disclosure risk. Disclosure risk and intrusion can be modeled; this has been done in a variety of ways by different investigators. Much of the recent research on protecting microdata has addressed the impact of statistical disclosure limitation on the analytic utility of the data. In particular, research has focused on measuring the information loss due to the application of disclosure limitation measures. More recent research is exploring the problem of maximizing utility while minimizing risk. Lastly, a prominent topic of recent research on statistical disclosure limitation is improving the quality of synthetic data.

View full report


"rpt_Disclosure.pdf" (pdf, 1.01Mb)

Note: Documents in PDF format require the Adobe Acrobat Reader®. If you experience problems with PDF documents, please download the latest version of the Reader®