The purpose of this research project was to provide the Office of Science and Data Policy at ASPE with some informed observations concerning the use of new data sources and data management strategies in policy research, evaluation, and decision-making at the federal level. A secondary goal was to identify successful training models in data science
Final Report Volume I: Background Paper, Declining Response Rates in Federal Surveys: Trends and Implications
Over the last decade, survey response rates have been steadily declining, and this decline has raised concerns across the federal government regarding the quality and utility of national survey data. Response rates are commonly considered the most important indicator of the representativeness of a survey sample and overall data quality, and low re
HHS Action Plan to Reduce Racial and Ethnic Health Disparities: Implementation Progress Report 2011-2014
The U.S. Department of Health and Human Services Action Plan to Reduce Racial and Ethnic Health Disparities (HHS Disparities Action Plan) is the most comprehensive federal commitment to date for reducing, and eventually eliminating disparities in health and health care.
Federal agencies have a long history of releasing data to the public, and they also have a legal obligation to protect the confidentiality of the individuals and organizations from which the data were collected. Federal agencies have successfully balanced these two objectives for decades.
Alexander, J. Trent, Michael Davern, and Betsey Stevenson. “Inaccurate Age and Sex Data in the Census PUMS Files: Evidence and Implications.” Public Opinion Quarterly , vol. 74, no. 3, Fall 2010, pp. 551-569. Barth-Jones, Daniel C. “The ‘Re-identification of Governor William Weld’s Medical Information: A Critical Re-examination of Hea
If a dataset that is protected against disclosure can be described as one in which a user cannot determine anything about a given individual from the dataset that could not be determined without the dataset, then by the same principle, preserving utility could be described in the following terms: the data should be no less useful to the users than
A critical element in preparing a public use file of microdata is the assessment of disclosure risk, which may involve estimating the probability of re-identification. Often this is an iterative process, in which a preliminary file is tested and if the risk is determined to be too high, additional protective measures are applied. For the MASSC met
A 2005 National Academy of Sciences (NAS) panel report on expanding access to research data notes that “at present, the obligation to protect individual respondents falls primarily on those who collect the data, thereby creating a disincentive for providing access to other researchers” (National Research Council 2005).
Preventing re-identification is the primary focus of de-identification and the statistical disclosure limitation methods discussed in this section, but a secondary objective (or result, if not necessarily an objective) is to limit what an intruder might learn from an apparent re-identification. Techniques that alter data values or exchange values
Sweeney was able to re-identify a large proportion of the individuals in the Massachusetts hospital discharge data because the combination of ZIP code, sex, and date of birth in many cases pointed to unique individuals in the city of Cambridge voter registration records, which provided their names.
HIPAA codified a de-identification process for health records that includes the removal of 18 specific direct and indirect identifiers. 8 Following Sweeney’s successful re-identification of the Massachusetts governor in a file of hospital discharge data, the protections mandated by HIPAA went well beyond the simpler, informal de-identification
Public use microdata play a critical role in research and policy analysis. Exploratory research and many types of policy analysis do not lend themselves well to the conditions that govern restricted access as described above. The creation of public use data that protect the confidentiality of the subjects begins with de-identification, but dependi
A number of federal agencies allow users remote access to agency data that are not released on public use files. This can take a number of different forms. For example, the Census Bureau allows users to request tabulations from decennial census files that include more detail than the numerous tabulations that can be obtained from the bureau websit
Several federal agencies maintain RDCs, in which approved users can access agency data that are not released to the public. The data never leave the site, and output produced from data held in the RDC cannot be removed without a disclosure review, which can take different forms. For example, RDC staff may be authorized to review output, or the out
Under licensing arrangements, prospective users request restricted (that is, non-public) data files through a formal application process. To obtain such data, users must demonstrate that the data will be stored and used in a secure environment that meets the issuing agency’s standards. This may require an initial agency inspection and a willingn
There are three basic mechanisms that federal agencies use to provide researchers with restricted access to data that are not released to the public. These include licensing, research data centers (RDCs), and secure remote access. These are discussed in turn below.
There are two general approaches that are used to release microdata in a way that protects the data from disclosure. One is by restricting access to the data, and the other is by restricting the data that are released for public access (National Research Council 2005). The latter approach encompasses a wide range of techniques that include suppres
The “mosaic effect” is a new term in the literature on confidentiality. It received prominent mention in Memorandum M-13-13 from the Office of Management and Budget (OMB), “Open Data Policy—Managing Information as an Asset” (OMB 2013), but a search for the term in the database Google Scholar produced no relevant hits.
When two files contain some of the same individuals, the records common to the two files can be linked if the two files also contain some of the same variables. When the two files contain unique and valid numeric identifiers, the records can be linked using “exact matching” on those fields—as is commonly done when files contain Social Securi