Data collected by the federal government are covered by a substantial array of regulations intended to protect the confidentiality of the data and the privacy of those from whom such data are obtained. While remaining attentive to these requirements, federal agencies have been able and willing to provide researchers across a wide range of discipli
The Open Data Initiative launched by the Executive Office of the President and OMB has encouraged the release of increasing numbers of datasets containing individual records (microdata) collected or sponsored by federal agencies from survey respondents, doctor and hospital visits, and medical claims. At the same time, federal agencies that release
Johnson noted that even revealing that a person filed a tax return is considered a disclosure by the IRS, so they set a high bar to prevent disclosure. IRS balances transparency and confidentiality by working in cooperation with the users. They formed a user group, and ask them to help make choices. Two outside users helped develop the updated ver
Mark Asiala explained that public use files that include microdata are only one part of a “suite” of data types released by the Census Bureau. Other types include tables produced from aggregated data for low levels of geography, special tabulations, and research papers. The potential threats that the Census Bureau faces include an ability
Minimizing Disclosure Risk in HHS Open Data Initiatives. B. Good Practices for Protecting Public Use Data
Panelists in this fourth session included: Mark Asiala, Census Bureau Barry Johnson, Statistics of Income Division, Internal Revenue Service Allison Oelschlaeger, Centers for Medicare & Medicaid Services Eve Powell-Griner, National Center for Health Statistics Fritz Scheuren, NORC at the University of Chicago Connie Citro, Committee on Nati
Moderator Steve Cohen identified the following themes during these presentations: game theory, data resources that allow for a breach, and how to simulate the threat of disclosure. He asked the panelists to address where we are heading in the next five years to address these threats.
Khaled El Emam noted that de-identification has been simplified through automation. The process of de-identification in practice involves assessing risk, classifying the variables in the file, and mapping the data. These contribute to specifications in an automated anonymization engine through which the original data are run to produce the anony
Minimizing Disclosure Risk in HHS Open Data Initiatives. A. What Are the Re-identification Threats to Releasing Federal Data to the Public?
Panelists in this third session on the day’s agenda included: Khaled El Emam, University of Ottawa and Privacy Analytics Brad Malin, Vanderbilt University Latanya Sweeney, Federal Trade Commission and Harvard University Denise Love, National Association of Health Data Organizations (NAHDO) Daniel Barth-Jones, Columbia University Steve Cohen,
The technical expert panel that was held on June 27, 2014 included two panel discussions: What Are the Re-identification Threats to Releasing Federal Data to the Public? and Good Practices for Protecting Public Use Data (see Appendix C for a detailed summary of the TEP presentations and discussions). In this chapter we summarize highlights of the
Minimizing Disclosure Risk in HHS Open Data Initiatives. B. Maintaining the Utility of Public Use Data
Steps taken to preserve the confidentiality of public use data have an adverse effect on the quality of the data and its general usefulness for research. Purdam and Elliot (2007) classify the impact of statistical disclosure limitation on data utility into two categories: (1) reduction of analytical completeness and (2) loss of analytical validity
A useful way to view disclosure risk was expressed by Marsh et al. (1991): the probability of disclosure is the product of two terms: (1) the probability of a successful re-identification conditional on someone trying to re-identify a record and (2) the probability that someone will try to re-identify a record. A data producer can lower the risk o
As we reported in Chapter II, there are extensive federal regulations designed to protect the confidentiality of the individuals and organizations whose private information is reported in federal databases. The laws regulating the sharing of federal data place much more responsibility upon the data producer than the user. Many of these laws specif
To understand the potential sources of disclosure risk requires an awareness of who might attempt to re-identify records in federal microdata, what are their capabilities, and what are their resources, including what data they might use in their re-identification attempts, and what tools are available to assist them in doing so. a. Potential In
Minimizing Disclosure Risk in HHS Open Data Initiatives. 1. Re-identification of Individuals in Data Released to the Public
There are exceedingly few documented instances of the re-identification of individual persons in datasets that have been released to the public. None has involved a sample survey or a federal government database, and few have involved data that were protected by methods that would be considered rigorous by today’s standards.
In a seminal paper on the protecting the confidentiality of data released to the public, Dalenius (1977) described the problem in the following terms: “access to a statistical database should not enable one to learn anything about an individual that could not be learned without access” (cited in Dwork and Naor 2010). The literature on disclosu
Minimizing Disclosure Risk in HHS Open Data Initiatives. IV. Issues in Protecting Microdata From Disclosure
In the previous chapter we reviewed methods that federal agencies and other organizations use to protect the confidentiality of the data they release. In this chapter we examine what the literature tells us about disclosure risk and about an important side-effect of protecting data from disclosure: a reduction in the data’s usefulness for resear
Much of the recent research on protecting microdata has focused on how the usefulness of the data is affected when methods of statistical disclosure limitation are applied. This topic is addressed in the next chapter. Research on ways to improve the protection afforded to public use microdata has addressed ways to enhance existing approaches rathe
Minimizing Disclosure Risk in HHS Open Data Initiatives. B. Methods of Statistical Disclosure Limitation
Statistical disclosure avoidance techniques for microdata have been well developed and widely published in journals, textbooks, and workshop and conference proceedings. The following two sources provide comprehensive accounts of these techniques: (1) Statistical Policy Working Paper 22 (FCSM 2005); and (2) Handbook on Statistical Disclosure Contro
Public use microdata play a critical role in research and policy analysis. Exploratory research and many types of policy analysis do not lend themselves well to the conditions that govern restricted access as described above. The creation of public use data that protect the confidentiality of the subjects begins with de-identification, but dependi
There are three basic mechanisms that federal agencies use to provide researchers with restricted access to data that are not released to the public. These include licensing, research data centers (RDCs), and secure remote access.