Minimizing Disclosure Risk in HHS Open Data Initiatives. A. Synthesis


Data collected by the federal government are covered by a substantial array of regulations intended to protect the confidentiality of the data and the privacy of those from whom such data are obtained. While remaining attentive to these requirements, federal agencies have been able and willing to provide researchers across a wide range of disciplines with open access to public use versions of their data. While some users can accept forms of restricted access in return for more extensive content or an ability to link records across datasets, microsimulation modelingA. Synthesis and quick turn-around policy analyses are among the many applications that require unrestricted access to their data, which only public use files can provide.

There have been no documented breaches of federal public use data—survey or administrative. And while occasional, documented breaches involving non-federal data have received a lot of attention, these breaches have been confined almost entirely to data that were not protected in ways consistent with current standards.

Despite this strong track record, views on the adequacy of de-identification or anonymization strategies cover the spectrum. From panelists we heard statements to the effect that, on the one hand, disclosure risks can be managed, and properly de-identified data can be released with little risk; while on the other hand, whatever we do to protect the data will eventually be defeated, and given enough time, effort, incentive, and money, some records may be re-identified. Such statements should not be taken out of context, as the assumptions that underlie them are important. There was general agreement; however, that the disclosure risk cannot be driven to zero without seriously weakening the analytical value of a dataset—an outcome that we will revisit below.

Sources and Evidence of Disclosure Risk. In general, the removal of direct identifiers is not sufficient to de-identify a dataset, as some of the remaining variables can serve as indirect or quasi-identifiers. In combination, they can uniquely identify individuals within the population or well-defined subpopulations. Voter registration files, which typically contain name, address, gender, and date of birth, have figured prominently in a number of the past breaches reported in the literature because a substantial proportion of the U.S. population bears a unique combination of gender, date of birth, and ZIP code.

In the wake of earlier, well-publicized re-identifications and federal legislation (specifically HIPAA), even non-federal files rarely report the combinations of characteristics that would permit large-scale re-identifications from voter registration records. Nevertheless, only 3 of the 33 states that sell inpatient discharge data are applying HIPAA standards. Recently, Sweeney was able to re-identify a substantial proportion of a small subsample of individuals in hospital discharge data released by the State of Washington, based on information compiled from newspaper accounts. It must be noted, however, that the Washington State data did not comply with widely used NAHDO standards, which were developed to improve confidentiality protections in the high volume of claims data released by states and other organizations. This episode underscores the potential vulnerability of public use data to the kinds of information becoming more widely accessible through Internet searches and web scraping techniques.

While the Washington State data may represent an exception among data distributed by states, many sales or transfers of personal data—particularly health data and financial data—are neither regulated by government nor held to professional standards designed to minimize disclosure risk. Consequently, such data may be especially vulnerable to re-identification, and such breaches can tarnish the movement toward greater transparency.

The Mosaic Effect and Related Threats. With regard to the mosaic effect and its implications, there was general agreement among the TEP participants that the public use files released by HHS and other federal agencies bear limited risk of disclosure, even in combination with other publicly available data. If de-identified properly, the public use data released by federal agencies are not useful to the intruder. On the whole, the confidentiality of federal datasets is threatened less by the release of other federal data than by data from two other sources: (1) datasets compiled by other organizations and released with weak or no de-identification, and (2) personal data revealed through social media. While agencies can control what they release, they cannot control what other organizations and private individuals release. Federal agencies recognize the growing threat from external data and are actively engaged in assessing and responding to the disclosure risks that they pose.

The explosion of personal information posted to the Internet through social media and other avenues means that the availability of data that might be of use to a potential intruder has grown as well. Such information does not cover the entire population, however. Its coverage is far less extensive than voter registration records, for example, and some investigators have noted the impact of incomplete coverage.

If the confidentiality of a dataset has been breached to the extent that many of the records have been correctly re-identified, then all of the variables contained in that dataset for these named individuals become available as potential indirect identifiers that could be used to break into another dataset containing some of the same individuals and some of the same variables. This is an unlikely scenario for a federal database; in actuality, the threat posed by the re-identification of records in a single database covering a narrow subset of the population is likely to be very small, given the limited number of records involved and their minimal overlap with records released by the federal government—particularly from sample surveys.

The release of well-protected federal files does not appear to increase the re-identification risk for other federal files; however, a number of agencies are conducting informed risk assessments. The sheer volume of data files made accessible through the Open Data Initiative is striking. Are there large numbers of individuals who appear repeatedly because of the way that file universes and samples are defined? This may be difficult if not impossible to determine, given restrictions on sharing or linking personal data across agencies, but there is no question that risk assessment would be enhanced with such information.

Protecting Public Use Data. To maximize their effectiveness, statistical disclosure limitation methods must be tailored to the data they are being used to protect. Each dataset faces unique risks, depending on the type of data, the population covered, the sample design, the variables that require the most protection, and the distribution of values on these variables.

Statistical Policy Working Paper 22 has provided valuable documentation of federal agency practice in protecting the confidentiality of federal data. However, the last update of this important resource was nearly 10 years ago, and agencies have upgraded their statistical disclosure limitation methods since then. TEP panelists asserted that regular updates—perhaps as often as every five years—would help to ensure that the document remains current.

A useful way to represent disclosure risk is that the probability of a re-identification is equal to the product of the probability of an attack, and the probability of a re-identification conditional on an attack. This implies that disclosure risk can be lowered by strengthening the protections applied to public use data or by taking steps that reduce the likelihood of an attempted re-identification. An important element in the security of data protected with rigorous methods may have been the absence of serious re-identification attacks. That no breaches of federal data have occurred to date may also be due in part to the fact that incentives were not high enough to inspire serious efforts to challenge the disclosure protections, although the protections themselves may have contributed to reducing these incentives.

Working in the opposite direction, however, the penalties to which data users are subject for attempting to re-identify records in public use files are light to non-existent. For most agencies the onus falls entirely upon the data producers in the event of a re-identification. This creates a tension for agencies releasing public use files in order to comply with Open Data policies and initiatives.

While de-identification has occupied much of our attention, the confidentiality of public use data is reinforced in other ways. Sampling is an important tool for reducing disclosure risk. This speaks to the security of the federal government’s many public use files of sample survey data. At the same time, however, the added protection afforded by sampling underscores the inherent risks in creating public use files from administrative records, which may not be sampled at all or are sampled at much higher rates than is typically found in surveys.

Reporting error, which can be particularly high in sample surveys, also provides protection against disclosure. For some variables in some databases, the application of additional protection in the form of masking may not be needed and may only reduce the quality of the data even further.

More generally, protection versus utility is the key trade-off. The effects of disclosure limitation methods on the quality and general analytic usefulness of public use data are an important concern. Over-zealous confidentiality protection can hamper data quality with no material improvement in data security. Moreover, it is possible to apply such strong protections to public use data that they become useless—and unused. To guard against this outcome, some agencies consult with subject matter experts and their major data users to better assess the trade-offs between analytic utility and effective disclosure limitation.

Secure remote access to restricted (not public use) data is growing as a solution to the problem of maintaining quality while protecting the confidentiality of the most sensitive data, and permitting access to data for applications that require specific indirect or even direct identifiers—for example, for analyses requiring linkage of files. Secure remote access is popular because it provides more convenient access than an RDC. In the end, however, neither of these options will replace public use data, as critical applications of public use data cannot be served by these alternatives, and the rapid expansion of the Open Data Initiative and related efforts only underscores the importance of public use data.

False re-identification is generally not addressed in regulations, yet it may present a more serious problem, potentially, than positive re-identifications. From a technical standpoint, an agency can protect against a positive (or true) re-identification but not a false re-identification. Furthermore, agencies cannot deny alleged re-identifications except to reiterate that a re-identification is not possible, as an explicit denial may reveal information that assists an intruder with a correct re-identification.

Evaluation of the effectiveness of disclosure limitation methods by attempting to re-identify records internally remains the most powerful approach to establishing that a public use file is secure. A number of agencies obtain external, identified data—both public and commercial—to use in their evaluations. The use of external data in attempts to re-identify records in a preliminary public use file directly addresses the potential threat that such data poses. Such efforts can produce even stronger tests when performed by experienced external consultants, who can provide a fresh approach that is not influenced—or encumbered—by detailed knowledge of how the public use data were created.

Frontiers of Research. Research is providing important enhancements to risk assessment and the ability to assign probabilities to disclosure risk. Disclosure risk and intrusion can be modeled; this has been done in a variety of ways by different investigators. Much of the recent research on protecting microdata has addressed the impact of statistical disclosure limitation on the analytic utility of the data. In particular, research has focused on measuring the information loss due to the application of disclosure limitation measures. More recent research is exploring the problem of maximizing utility while minimizing risk. Lastly, a prominent topic of recent research on statistical disclosure limitation is improving the quality of synthetic data.

View full report


"rpt_Disclosure.pdf" (pdf, 1.01Mb)

Note: Documents in PDF format require the Adobe Acrobat Reader®. If you experience problems with PDF documents, please download the latest version of the Reader®