Recent open data initiatives by the Department of Health and Human Services (HHS) and the White House have encouraged the release of increasing numbers of datasets containing individual records (microdata) collected from survey respondents, doctor and hospital visits, and medical claims. At the same time, federal agencies that release data collected from individuals and establishments have an obligation under the law to protect the confidentiality of those supplying the data as well as the information provided.2 The challenge faced by HHS and other federal agencies is to achieve an appropriate balance between providing the public with useful datasets and protecting the confidentiality of the individuals and establishments whose information is contained in the data. Addressing this challenge is made more difficult by what has been termed in other circles the “mosaic effect,” the idea that disparate pieces of information—though individually of limited utility—become significant when combined with other types of information (Pozen 2005). The concern is that the datasets being released in large numbers—more than 1,000 by HHS alone—provide the pieces of intelligence that when assembled correctly disclose information that the federal government is required to maintain as confidential.
This paper reviews the issues that arise in protecting the confidentiality of data collected by the federal government. The literature discussed includes standard references on the topic from the past 10 years as well as additional articles and unpublished papers identified through searches of relevant journals and conference proceedings and bibliographic sources. Key words searched included de-identification, re-identification, disclosure avoidance, public use files, confidentiality, disclosure risk, and mosaic effect.
Chapter II discusses the concept of disclosure, reviews some instances in which individuals have been re-identified in data released to the public (although not by the federal government), explores the sources of disclosure risk, and concludes with observations on the mosaic effect. Chapter III reviews approaches to protecting data against disclosure, comments on the legal environment, and discusses methods commonly used to assess disclosure risk in public use datasets.
There is an inherent contradiction between protecting data from disclosure and maximizing its value to users. Restricting access to the data or altering its contents so as to reduce the risk of disclosure also diminishes the data’s usefulness for research. Chapter IV discusses ways in which the utility of data is reduced by strategies to limit disclosure and, related to this, how the loss of information can be measured.
1 This background paper was prepared by John L. Czajka, Amang Sukasih, and Craig Schneider.
2 For a summary of recent documents explaining the open data policy and a review of relevant laws establishing the government’s object to project the confidentiality of the data it collects, see Appendix D.