Minimizing Disclosure Risk in HHS Open Data Initiatives. A. Background


The Open Data Initiatives launched by the White House have as their goal making government information resources more accessible to the public in machine-readable form, and encouraging the use of such data by entrepreneurs to aid in the creation of new products, services, and jobs. However, there is concern that certain users of the datasets made publicly available via these open data initiatives will be able to re-identify individuals or firms whose information is contained in these datasets. The challenge faced by federal agencies is to achieve the appropriate balance between (1) providing the public with useful datasets, and (2) protecting the privacy and confidentiality of individuals whose information is contained in this data.

The mosaic effect refers to the concept that the availability of increasing numbers of micro datasets, including de-identified datasets, increases the risk of disclosure beyond the risk associated with any particular dataset because of the totality of the information in the other datasets. Although there is great interest in the mosaic effect, ASPE had not yet been able to identify analyses of the issue, or evidence of risk quantification, or principles and best practices for addressing the issue.

The federal and HHS-specific Open Data Initiatives are releasing more data for public use. At the same time, there is a proliferation of data from other, non-federal sources, including social media. A key question is whether the risk of re-identification grows as more datasets become available. If it does, what principles and practices can agencies use to offset this increased risk of disclosure? Changing technology has also increased the potential threat of re-identification. Faster and less expensive data processing capabilities and sophisticated software make it much more feasible for nefarious actors to combine the information released in numerous datasets, and then use these data to try to determine individuals’ identities.

To counter these threats, there is a large body of statistical disclosure avoidance techniques that have been developed by statisticians and computer scientists to minimize the risk of disclosure, and researchers continue to advance the state of the art. Federal agencies have implemented a variety of policies and procedures to protect the confidentiality of the data they release, and these have been highly effective in protecting against disclosures in individual datasets. 3 The TEP addressed the topic of whether the current techniques and data release procedures are sufficient to protect confidentiality in light of the mosaic effect and growing threats elsewhere, or if new techniques and data release mechanisms are needed. 4 The meeting provided a self-assessment regarding potential new threats to federal data privacy protection and the agencies’ capacity to address them.

3 Many of these policies are described in Statistical Policy Working Paper 22 produced by the Federal Committee on Statistical Methodology (FCSM) in 2005.

4 Multiple disclosure avoidance techniques are discussed in Chapter III. The data release mechanisms include public use data files, de-identified data, data use agreements, and research data centers (RDCs).

View full report


"rpt_Disclosure.pdf" (pdf, 1.01Mb)

Note: Documents in PDF format require the Adobe Acrobat Reader®. If you experience problems with PDF documents, please download the latest version of the Reader®