On May 9, 2013, President Obama issued the executive order, “Making Open and Machine Readable the New Default for Government Information,” in which he directed the Office of Management and Budget (OMB) to issue an Open Data Policy throughout the federal government. The objectives of this executive order were to advance the management of government information as an asset throughout its life cycle; to promote interoperability and openness; and, whenever possible and legally permissible, to ensure that data are released to the public in ways that make the data easy to find, accessible, and usable. Federal agencies have a long history of releasing data to the public, and they also have a legal obligation to protect the confidentiality of the individuals and organizations from which the data were collected. Federal agencies have successfully balanced these two objectives for decades. With the new emphasis on expanding public access to federal data, coupled with the increasing availability of data from other sources, federal agencies are continuing to ensure that the combination of data already available and the data they are preparing to release does not enable the identification of individuals or other entities through what has been termed the “mosaic effect.” 1
To gain more insight into the mosaic effect and its implications for the continued release of data to the public while minimizing the risk of disclosing personal information, the Office of the Assistant Secretary for Planning and Evaluation (ASPE) in the U.S. Department of Health and Human Services (HHS) contracted with Mathematica Policy Research to convene a technical expert panel (TEP), prepare background materials, and summarize what was learned from the panel discussion and the background research in a final report. 2 The goals of the project were (1) a balanced and scientifically sound assessment of the mosaic effect, (2) identification of any unique increased risk associated with the mosaic effect, and (3) identification of data release policies and best practices that can prevent or reduce disclosure due to the mosaic effect.
1 The concept of a mosaic effect is derived from the mosaic theory of intelligence gathering, in which disparate pieces of information—though individually of limited utility—become significant when combined with other types of information (Pozen 2005).
2 More specifically, the project’s components are: (1) A pair of background papers, one reviewing federal policies and procedures regarding the use and protection of personal data and the other an environmental scan of literature relevant to releasing federal microdata in light of the risks presented by the mosaic effect; (2) a TEP tasked with addressing the mosaic effect through a discussion of best practices in protecting confidentiality in open data initiatives; and (3) this report, which synthesizes the findings from the background papers and the proceedings of the TEP meeting. The TEP meeting was held on June 27, 2014. The meeting agenda is reproduced in Appendix A, and a list of attendees is included in Appendix B. Minutes from the TEP meeting are presented in Appendix C. The two background papers are included in Appendices D and E.
The Open Data Initiatives launched by the White House have as their goal making government information resources more accessible to the public in machine-readable form, and encouraging the use of such data by entrepreneurs to aid in the creation of new products, services, and jobs. However, there is concern that certain users of the datasets made publicly available via these open data initiatives will be able to re-identify individuals or firms whose information is contained in these datasets. The challenge faced by federal agencies is to achieve the appropriate balance between (1) providing the public with useful datasets, and (2) protecting the privacy and confidentiality of individuals whose information is contained in this data.
The mosaic effect refers to the concept that the availability of increasing numbers of micro datasets, including de-identified datasets, increases the risk of disclosure beyond the risk associated with any particular dataset because of the totality of the information in the other datasets. Although there is great interest in the mosaic effect, ASPE had not yet been able to identify analyses of the issue, or evidence of risk quantification, or principles and best practices for addressing the issue.
The federal and HHS-specific Open Data Initiatives are releasing more data for public use. At the same time, there is a proliferation of data from other, non-federal sources, including social media. A key question is whether the risk of re-identification grows as more datasets become available. If it does, what principles and practices can agencies use to offset this increased risk of disclosure? Changing technology has also increased the potential threat of re-identification. Faster and less expensive data processing capabilities and sophisticated software make it much more feasible for nefarious actors to combine the information released in numerous datasets, and then use these data to try to determine individuals’ identities.
To counter these threats, there is a large body of statistical disclosure avoidance techniques that have been developed by statisticians and computer scientists to minimize the risk of disclosure, and researchers continue to advance the state of the art. Federal agencies have implemented a variety of policies and procedures to protect the confidentiality of the data they release, and these have been highly effective in protecting against disclosures in individual datasets. 3 The TEP addressed the topic of whether the current techniques and data release procedures are sufficient to protect confidentiality in light of the mosaic effect and growing threats elsewhere, or if new techniques and data release mechanisms are needed. 4 The meeting provided a self-assessment regarding potential new threats to federal data privacy protection and the agencies’ capacity to address them.
3 Many of these policies are described in Statistical Policy Working Paper 22 produced by the Federal Committee on Statistical Methodology (FCSM) in 2005.
4 Multiple disclosure avoidance techniques are discussed in Chapter III. The data release mechanisms include public use data files, de-identified data, data use agreements, and research data centers (RDCs).
B. Organization of the Report
The concerns discussed above—new technologies, increasing amounts of data being made available to the public, growing numbers of other data sources, and the tools available to determined adversaries—provide a compelling motivation for federal agencies to continuously re-assess the risks of disclosure due to the mosaic effect. ASPE established this project to promote greater sharing of information about methods, data sources, and how to minimize disclosure risk among federal agencies in order to benefit the government and the public. The communication of best practices, lessons learned, and the state of the art in de-identification and re-identification methodologies should be useful to federal officials and others who make data publicly available and are simultaneously responsible for ensuring the privacy of respondents and the confidentiality of these data files.
This report summarizes the principal findings from the project. Chapter II presents a summary of federal legislation and regulations regarding the release and protection of personal data along with recent policy statements with respect to open data. The chapter is based on material presented in the first background paper (Appendix D). Chapter III summarizes federal procedures for providing the public with access to government data while preserving the confidentiality of the individuals and businesses from whom the data was collected. This chapter draws on material presented in both background papers and one of the TEP sessions. Chapter IV reviews key issues in protecting public use microdata from disclosure. The chapter draws on the second background paper (Appendix E). This is followed in Chapter V by a summary of the experts’ views expressed during two panel discussions at the TEP meeting. Chapter VI synthesizes the key findings from the project.