Minimizing Disclosure Risk in HHS Open Data Initiatives. F. Methods of Disclosure Limitation Used by Federal Agencies


Table D.1 summarizes the methods of disclosure limitation used by federal statistical agencies at the time of the CDAC 2005 update of Statistical Policy Working Paper 22. Mathematica surveyed representatives of 14 agencies to ask if the agency has made any changes to its procedures. Results are reported in the final column of the table.

Table D.1. Summary of Agency Practices for Protecting Public use Microdata as Reported in Statistical Policy Working Paper 22 (2005), with Updates

Agency Public use microdata and who reviews Restricted access allowed for researchers Statistical disclosure limitation methods for public use microdata Any update?
Energy Information Administration (EIA) Yes – office review No EIA does not have a standard for statistical disclosure limitation techniques for microdata files. The only microdata files for confidential data released by EIA are for the Residential Energy Consumption Survey (RECS) and the Commercial Buildings Energy Consumption Survey (CBECS). In these files, various standard statistical disclosure limitation procedures are used to protect the confidentiality of the data for individual households and buildings. These procedures include: eliminating identifiers, limiting geographic detail, omitting or collapsing data items, top-coding, bottom-coding, interval- coding, rounding, substituting weighted average numbers (blurring), and introducing noise through a data adjustment method that randomly adjusts respondent level data within a controlled maximum percentage level around the actual published estimate. After applying the randomized adjustment method to the data, the mean values for broad population groups based on the adjusted data are the same as the mean values generated from the unadjusted data. No updates. EIA still applies the same methodologies for protecting the CBECS and RECS public use files.
National Science Foundation (NSF) Yes – Meet or exceed Census public use products that are merged Yes When releasing public-use microdata files, individual identifiers are removed from all records and other high risk variables that contain distinguishing characteristics are modified to prevent identification of survey respondents and their responses. Top codes and bottom codes are employed for numeric fields to avoid showing extreme field values on a data record. Values beyond the top code or bottom code are replaced either by the average of the values in excess of the respective top code or bottom code or through the application of various imputation methodologies. No updates; 2005 description remains accurate.
U.S. Census Bureau Yes -- Disclosure Review Board Yes

“Microdata cannot show geography below a population of 100,000. For the most detailed microdata, that threshold is raised to 250,000 or higher.” “For small populations or rare characteristics noise may be added to identifying variables, data may be swapped, or an imputation applied to the characteristic. Census data, which lacks the component of protection provided by sampling, employs targeted swapping in addition to the combination of table design and thresholds described above.”

Hawala, Zayatz, and Rowland (2004): “To insure that any data tabulation requested by external users will not disclose respondents’ identities, the U.S. Census Bureau uses data recoding and data swapping (Zayatz 2003).”

Zayatz (2005): “There are several disclosure avoidance techniques that we are currently using for our microdata files including geographic thresholds, rounding, noise addition, categorical thresholds, topcoding, and data swapping.”

Subsampling (only a fraction of the full microdata file from the survey/census is released) was used for the Decennial long form prior to ACS and is now used for ACS releases.

Synthetic data use is limited to the production of partially synthetic estimates for certain, small, specialized sub- populations. These subpopulations comprise only a small subset of the microdata files released. Synthetic data, in its various forms, is not widely used to protect Census microdata files.

Bureau of Labor Statistics (BLS) BOC Collects Title 13 Yes BLS releases very few public use microdata files. Most of these microdata files contain data collected by the Census Bureau under an interagency agreement and Census's Title 13 authority. For these surveys (Current Population Survey, Consumer Expenditure Survey) the Census Bureau determines the statistical disclosure limitation procedures that are used. BLS releases public-use data files from three surveys in the family of the National Longitudinal surveys. Disclosure limitation methods used for the public use microdata files containing data from the National Longitudinal Survey of Youth, collected under contract by Ohio State University and Research Center at the University of Chicago, are similar to those used by the Census Bureau. No update, but 2005 description has been edited.
National Center for Education Statistics (NCES) Yes -- Disclosure Review Board Yes

All direct individually identifiable information (for example, school name, individual name, addresses) is stripped from the public use file. Continuous variables are top and bottom coded to protect against identification of outliers. After this has been done, a casual data intruder might identify an individual respondent by first identifying the sampled institution for the individual. To prevent identification of the sampled institution, all known publicly available lists of education institutions that contain institutions’ names and addresses are gathered. Each list is matched with the sample file using all common variables between the two files. If an institution can be identified to within 2 other institutions, using an appropriate distance measure, then that is a disclosure risk and must be resolved before releasing the data. If too many disclosure risks are obtained then a common variable(s) may be dropped from the public-use file, or the variable(s) may be coarsened. If there are only a few identified disclosure risks found, the appropriate action is to selectively perturb a set of the common variables until all disclosure risks are resolved. This analysis is repeated sequentially for each list file until it can be applied to each list file without identifying any disclosure risks.

Whenever institution head, teacher, student, or parent data are clustered, a subsampling of respondents is required. Data from respondents selected into this subsample are reviewed using an additional disclosure edit. The edit is either: (1) a blanking and imputing, or data swapping of a sample of sensitive items collected; or (2) a data swapping of the key identification variable of the respondent or institution. The amount of editing is set at a level sufficient to protect the confidentiality of the respondent, while not compromising the analytic usefulness of the data file.

The basic procedures are still the same. NCES has added additional measures as diagnostics to determine which of several trial data perturbations to select to meet the requirement to protect the confidentiality of the respondent, while not compromising the analytic usefulness of the data file.
National Center for Health Statistics (NCHS) Yes – Con- fidentiality Officer and Disclosure Review Board Yes It is NCHS policy to make microdata files available to the scientific community so that additional analyses can be made for the country’s benefit. Such files follow guidance and principles contained in the NCHS Staff Manual on Confidentiality (September, 2004), Section 9 "Avoiding Inadvertent Disclosures Through Release of Microdata," and the NCHS Checklist for the Release of Micro Data Files. These guidelines require that detailed information that could be used to identify individuals (for example, date of birth) should not be included in microdata files. The identities of geographic places and characteristics of areas with fewer than 100,000 people are never to be identified, and it may be necessary to set this minimum at a higher number if research or other considerations so indicate. Information on the drawing of the sample that could identify data subjects should not be included.

The techniques, methods, and guidance used to protect NCHS’s public use microdata have largely remained unchanged since 2004, although there are a few exceptions. The changes detailed below have been in response to changes in technology and proliferation of external data, and were made to reduce disclosure risk to individuals in NCHS data systems.

1. Vital record (birth, death, fetal death and linked birth/infant death) public use microdata files beginning with the 2005 data year contain individual-level vital event data at the national level only. The files for births, deaths, fetal deaths and linked birth/infant death generally include most other items from the vital record with the exception of exact dates.

2. Some NCHS surveys collect information on observable health conditions/limitations or rare conditions. This information is often excluded from public use microdata files because the information, in combination with the extensive information for other characteristics, is considered to pose too great a risk of respondent re-identification by knowledgeable insiders or from media coverage.

3. The level of detail for some variables has been reduced on public use microdata files. This includes geographic information for almost all files, but also includes items such as household relationships, race/ethnic categories and other observable characteristics that could increase risk of identification when combined with other indirect identifying information.

Compared to 2004, NCHS staff responsible for developing public use microdata files spend more time identifying and researching external files available via the Internet to assess whether external sources can be used to re-identify NCHS survey respondents. Advances in computer technology, the introduction of Big Data and Open Data initiatives pose new challenges for preparing public use microdata files that were not present 10 years ago.

National Center for Health Statistics (NCHS) continued Yes – Confidentiality Officer and Disclosure Review Board Yes Refer back to previous row. Although NCHS has reduced the level of detail available on public use microdata files since 2004, we have attempted to balance this by making non-public use microdata files more available through expansion of RDC sites, use of special agreements permitting access under controlled conditions (e.g., Designated Agent Agreements or DUAs), and development of new access tools. a. The NCHS RDC now offers researchers four access modes to access restricted use NCHS microdata including: (1) on-site at the NCHS RDC, (2) on-site at a Census RDC, (3) remote access, and (4) staff assisted research option. Additional information about each access mode can be found at the following location: B2AccessMod/ACs200.htm. b. NCHS is developing new tools for data access. For example, NCHS is developing a National Health Interview Survey Online Analytic Real-Time System (OARS) to help meet the need for state-level estimates. This tool will allow health experts, policymakers, journalists, and others to search and compare health statistics by county, region, and state nationwide for grant proposals, needs assessments, research, news reporting, and policymaking. Additional information on OARS can be found at: nchs/data/bsc/nhis_online_analytic_realtime_system.pdf NCHS remains committed to making data as widely available as possible while protecting the confidentiality of respondents. Approximately 95 percent of NCHS collected data are released in public use microdata files and most of the remaining data are available under controlled conditions that meet our legislative mandates to protect respondent identity.
Agency for Healthcare Research and Quality (AHRQ) Yes – Disclosure Review Board Yes The disclosure limitation procedures used by AHRQ are similar to those of NCHS. No updates; AHRQ continues to use procedures similar to NCHS but without the NCHS-specific revisions detailed above.
National Agricultural Statistics Service (NASS) No Yes NA No updates
Economic Research Service (ERS) No Yes NA Noupdates
Bureau of Economic Analysis (BEA) No Yes NA Noupdates
Social Security Administration (SSA) Yes - 2 Disclosure Review Boards. One handles Title 13 data; the other does not. Yes When releasing public use microdata files, individual identifiers are removed from all records, and other distinguishing characteristics are modified to prevent identification of persons to whom a record pertains. Records are sequenced in random order to avoid revealing information due to the ordering of records on the file. Top codes and bottom codes are employed for numeric fields to avoid showing extreme field values on a data record. Values beyond the top code or bottom code are replaced by the average of the values in excess of the respective top code or bottom code. Top code and bottom code values are derived at the national level and the replacement values are derived and applied at the state level when appropriate. Values shown for some categorical fields are combined into broader groupings than those present on the internal file, and dollar amounts are rounded. Top code and bottom code values, replacement values, and related information are provided to users as part of the file documentation. Since 2010, the DRB has built a working relationship with the Office of Open Government in part to prevent the mosaic effect. Based on White House Open Government initiatives, SSA has enhanced their procedures for releasing data on the Agency website and onto The National/Homeland Security and Privacy/Confidentiality Checklist and Guidance (referred to as the NHSP Checklist) is part of the guidance from the White House and is to be used by departments and agencies submitting datasets for publication on This Checklist augments the processes SSA is using to meet its existing statutory, regulatory or policy requirements for protecting national/homeland security and privacy/confidentiality interests. Since 2012, the DRB includes an external voting board member from the U.S. Census Bureau. This provides an avenue for the DRB to ensure that agency staff are informed of the latest disclosure avoidance techniques utilized and recommended by the Bureau's DRB.
Internal Revenue Service (IRS) Yes - Legislatively Controlled No SOI produces one annual public- se microdata file, known as the SOI “tax model”, containing a sample of data based on the Form 1040 series of individual tax returns. The disclosure protection procedures applied to this file include: (1) subsampling certainty records at a 33 percent rate; (2) removing certain records having extreme values; (3) suppressing certain fields from all records and geographical fields from high income records; (4) top coding and modifying some fields; (5) blurring some fields of high income records by locally averaging across records; and (6) rounding amount fields to four significant digits. To help ensure that taxpayer privacy is protected in the SOI tax model file, SOI has periodically contracted with experts who employ “professional intruder” techniques to both verify that confidentiality is protected and to inform the techniques to be applied to future releases of the SOI tax model file. SOI reviews its statistical disclosure limitation procedures for its public use microdata file and introduces enhancements on an ongoing basis. For example, the maximum sampling rate was changed to 10 percent several years ago, and multivariate blurring replaced univariate bluring for key fields on high-income returns. SOI is currently redesigning its public use file.
Bureau of Transportation Statistics (BTS) Yes – Disclosure Review Board No The BTS Confidentiality Procedures Manual documents the confidentiality procedures for the agency. For most microdata and tabular data products, BTS program managers are required to complete a checklist identifying potential disclosure risks and outline any steps taken to mitigate such risks. The BTS’s DRB reviews the data product and checklist and makes a final determination on disclosure risk. The DRB can recommend application of SDL methods prior to public dissemination. BTS uses various microdata SDL methods based on the disclosure review findings and the unique characteristics of the data files. Some SDL procedures used include data suppression and modification. Data modification includes recoding continuous variables into categorical variables, collapsing categories, top and bottom coding, introduction of noise, and data swapping. BTS program managers must also identify any external data that could be matched to BTS datasets and take steps to minimize the ability to match. No updates;2005 description remains accurate.
Bureau of Justice Statistics (BJS) Yes - legislatively controlled agency review No The same requirements under Title 13 of the U.S.C. that cover the Census Bureau are followed by BJS for those data collected for BJS by the Census Bureau. Standards for microdata protection are incorporated in BJS enabling legislation. Individual identifiers are routinely stripped from all microdata files before they are released for public use. BJS has allowed access to restricted files since at least the year 2000, if not before. Direct identifiers are routinely removed from all microdata files prior to release. Indirect identifiers—for example, geographic identifiers, dates of unique events, or age— undergo disclosure avoidance measures commensurate with the level of release (public, restricted, or enclaved). Measures commonly used include categorization of continuous variables, top- or bottom-coding, rounding, addition of noise, and data swapping. Most restricted microdata files are available by application from the National Archive of Criminal Justice Data (NACJD). The Archive is in the process of implementing and expanding technology that allows remote access to restricted microdata files. With this technology, the user does not receive or download the microdata, but rather logs into and analyzes the data on a secure NACJD server. In 2011, BJS began making extremely sensitive microdata files available onsite at the University of Michigan in the Interuniversity Consortium for Political and Social Research (ICPSR) Data Enclave in Ann Arbor, MI (also by application).


