Mark Asiala explained that public use files that include microdata are only one part of a “suite” of data types released by the Census Bureau. Other types include tables produced from aggregated data for low levels of geography, special tabulations, and research papers.
The potential threats that the Census Bureau faces include an ability to identify individuals by using the tables directly, matching external data to public use files, or using data products in combination.
The Bureau’s strategies for protecting data from disclosure vary with the type of data. To reduce disclosure risk for tables, they alter the table design and use combinations of data swapping and partially synthetic variables on the source files. For public use files, they apply size thresholds for geographic and category detail; noise addition for some variables; and additional data swapping and/or partial synthesis of data. Rounding is the primary strategy in special tabulations and research papers. The Bureau prefers to minimize the use of suppression techniques because they harm the utility of the data, and he recommends that data holders think about whether they can mask a particular characteristic rather than an entire record.
For tables, the granularity of data cells raises the risk of re-identification—too much detail leads to a “pseudo-microdata” file. A good rule of thumb is not to publish tables with more than 100 cells. Treating the records at risk before producing tabulations is preferable to having to suppress cells.
Strategies used for public use files include sub-sampling, thresholds for identification of geographic areas and categories, additional data swapping for “special uniques,” noise infusion, and synthetic data. The threshold for identification of geographic areas is 100,000 population size, and the threshold for categories is 10,000 nationally. A “special unique” case will stand out even with a large sample size, so additional swapping is done for such cases.
For special tabulations and research papers, the Bureau rounds the data to protect small cells and coarsens the detail. In some cases they impose a hard rule such as publishing no detail of a given characteristic below the county or state.
The Census Bureau is working on a microdata analysis system that allows tabulations off the entire data file, but with certain restrictions/protections, as an alternative to public use files. They are also considering creating a bridge between public use files and research data centers to find a middle ground between these approaches.
Barry Johnson discussed the role of the statistics arm of the IRS, which has data from tax returns but no survey data. IRS public use data has been the core of tax and economic modeling for the Congressional Budget Office, the Urban Institute, and the National Bureau of Economic Research. The IRS works with the Federal Reserve Board to plan disclosure protection of the data collected in the Survey of Consumer Finances.
Tax data is releasable because there are not many demographic pieces of data on the 1040 form, and this makes the intruder’s job more difficult. Their data is constrained by accounting rules, so it is difficult to perturb, and because of the alternative minimum tax rules and other complexities, it is important to preserve these accounting relationships in the data. IRS removes obvious identifiers, and relies on a portfolio of techniques to protect especially vulnerable records and variables.
IRS works with experts in disclosure limitation and with fellow agencies to protect data/variables, and based on these reviews updates the individual tax public use file regularly, and evaluates how effective the changes have been. Having access to the full population dataset makes evaluation/simulation effective (because the public use file can be matched to the population data to assess risk).
Allison Oelschlaeger commented that CMS has mostly administrative data. CMS formed the Office of Information Products and Data Analytics a few years ago to maximize data for internal and external users. CMS produces two types of de-identified data products: (1) stand-alone public use files of basic Medicare claims data, with direct identifiers removed and careful review of indirect identifiers; and (2) a synthetic file, which is a good way for researchers to develop expertise before doing research with the actual data. CMS has launched a “virtual RDC.” Researchers can submit a research protocol, conduct the research, and then any outputs are reviewed/cleared by CMS; researchers do not have to satisfy security requirements at their own facilities this way.
Eve Powell-Griner pointed out that most of NCHS’s data is survey data, but they also offer vital statistics records and physical exam data. NCHS relies on Statistical Policy Working Paper 22 and standard limitation techniques, and in addition thinks about disclosure limitation from the beginning of the process and discusses any data issues with their review board.
NCHS is becoming somewhat more conservative in what is being released in public use files (for example, geography fields). None of NCHS data is inaccessible except for personally identifiable information. They focus on rare characteristics that would be identifiable and are sensitive to rare information fields. NCHS has deployed new software to extend risk assessment and assign a probability of disclosure, and a priority is to keep the genetic data collected in the National Health and Nutrition Examination Survey under tight control.
Fritz Scheuren stated that one advantage data holders have is the variety of disclosure prevention techniques available, but on the other hand the disadvantage is the extent of the variety available. He said there is a “civil war” going on between the data quality people and the information quality people. The user cannot rely on tables; they want to use it in a microdata simulation model. Regarding disclosure prevention, he is concerned that data holders are not keeping up with the prey-predator problem: whatever federal agencies do to protect the data will eventually be defeated.