If a dataset that is protected against disclosure can be described as one in which a user cannot determine anything about a given individual from the dataset that could not be determined without the dataset, then by the same principle, preserving utility could be described in the following terms: the data should be no less useful to the users than
A critical element in preparing a public use file of microdata is the assessment of disclosure risk, which may involve estimating the probability of re-identification. Often this is an iterative process, in which a preliminary file is tested and if the risk is determined to be too high, additional protective measures are applied. For the MASSC met
A 2005 National Academy of Sciences (NAS) panel report on expanding access to research data notes that “at present, the obligation to protect individual respondents falls primarily on those who collect the data, thereby creating a disincentive for providing access to other researchers” (National Research Council 2005).
Preventing re-identification is the primary focus of de-identification and the statistical disclosure limitation methods discussed in this section, but a secondary objective (or result, if not necessarily an objective) is to limit what an intruder might learn from an apparent re-identification. Techniques that alter data values or exchange values
Sweeney was able to re-identify a large proportion of the individuals in the Massachusetts hospital discharge data because the combination of ZIP code, sex, and date of birth in many cases pointed to unique individuals in the city of Cambridge voter registration records, which provided their names.
HIPAA codified a de-identification process for health records that includes the removal of 18 specific direct and indirect identifiers. 8 Following Sweeney’s successful re-identification of the Massachusetts governor in a file of hospital discharge data, the protections mandated by HIPAA went well beyond the simpler, informal de-identification
Public use microdata play a critical role in research and policy analysis. Exploratory research and many types of policy analysis do not lend themselves well to the conditions that govern restricted access as described above. The creation of public use data that protect the confidentiality of the subjects begins with de-identification, but dependi
A number of federal agencies allow users remote access to agency data that are not released on public use files. This can take a number of different forms. For example, the Census Bureau allows users to request tabulations from decennial census files that include more detail than the numerous tabulations that can be obtained from the bureau websit
Several federal agencies maintain RDCs, in which approved users can access agency data that are not released to the public. The data never leave the site, and output produced from data held in the RDC cannot be removed without a disclosure review, which can take different forms. For example, RDC staff may be authorized to review output, or the out
Under licensing arrangements, prospective users request restricted (that is, non-public) data files through a formal application process. To obtain such data, users must demonstrate that the data will be stored and used in a secure environment that meets the issuing agency’s standards. This may require an initial agency inspection and a willingn
There are three basic mechanisms that federal agencies use to provide researchers with restricted access to data that are not released to the public. These include licensing, research data centers (RDCs), and secure remote access. These are discussed in turn below.
There are two general approaches that are used to release microdata in a way that protects the data from disclosure. One is by restricting access to the data, and the other is by restricting the data that are released for public access (National Research Council 2005). The latter approach encompasses a wide range of techniques that include suppres
The “mosaic effect” is a new term in the literature on confidentiality. It received prominent mention in Memorandum M-13-13 from the Office of Management and Budget (OMB), “Open Data Policy—Managing Information as an Asset” (OMB 2013), but a search for the term in the database Google Scholar produced no relevant hits.
When two files contain some of the same individuals, the records common to the two files can be linked if the two files also contain some of the same variables. When the two files contain unique and valid numeric identifiers, the records can be linked using “exact matching” on those fields—as is commonly done when files contain Social Securi
Following Dwork and Naor (2010), the data that a potential intruder would use to re-identify records on a public use file may be described as auxiliary data. The challenge in protecting a public use file from any possibility of re-identification is the inability to guarantee that there are no auxiliary data in anyone’s possession that would enab
Potential intruders—those who might attempt to re-identify entities in the data and use the information in some way—encompass a wide range of possible users. Among these the greatest threat is posed by those with exceptional computer skills or with access to information on a large number of identified individuals or with exceptionally detailed
To understand the potential sources of disclosure risk, we need to be aware of who are the potential intruders—that is, those who might attempt to re-identify records in federal microdata—and their capabilities, what data they might use in their re-identification attempts, and what tools are available to assist them in doing so.
Minimizing Disclosure Risk in HHS Open Data Initiatives. A. Re-identification of Individuals in Data Released to the Public
There are exceedingly few documented instances of the re-identification of individual persons in datasets that have been released to the public. None has involved a sample survey or a federal government database, and few have involved data that were protected by methods that would be considered rigorous by today’s standards.
In a seminal paper on the protecting the confidentiality of data released to the public, Dalenius (1977) described the problem in the following terms: “access to a statistical database should not enable one to learn anything about an individual that could not be learned without access” (cited in Dwork and Naor 2010). The literature on disclosu
Recent open data initiatives by the Department of Health and Human Services (HHS) and the White House have encouraged the release of increasing numbers of datasets containing individual records (microdata) collected from survey respondents, doctor and hospital visits, and medical claims.