Minimizing Disclosure Risk in HHS Open Data Initiatives. Session 4: Good Practices for Protecting Public Use Data


Connie Citro, Committee on National Statistics (Moderator)

Mark Asiala, Census Bureau

Barry Johnson, Statistics of Income, Internal Revenue Service

Allison Oelschlaeger, Centers for Medicare & Medicaid Services

Eve Powell-Griner, National Center for Health Statistics

Fritz Scheuren, NORC at the University of Chicago


Connie Citro:

People voluntarily offer information to health researchers (for example, participants in a clinical trial) with the understanding that the researchers will keep the information confidential. A breach would be a violation of this trust. What are the implications if this data is made public? How much thinking have agencies done regarding the probability of disclosure? What are the likely risks versus the potential risks? It is probably time to update Statistical Policy Working Paper 22.

Mark Asiala:

Public use files that include microdata are only one part of a “suite” of data types released by the Census Bureau. Other types include tables produced from aggregated data for low levels of geography, special tabulations, and research papers.

Potential threats include an ability to identify individuals by using the tables directly, matching external data to public use files, or using data products in combination.

Strategies for protecting data from disclosure vary with the type of data. For tables, the table design and combinations of data swapping and partially synthetic variables on the source files are used to reduce disclosure risk. For public use files, size thresholds for geographic and category detail; noise addition for some variables; and additional data swapping and/or partial synthesis of data are used. Rounding is the primary strategy in special tabulations and research papers. We want to minimize the use of suppression techniques because they harm the utility of the data. We would prefer to mask a particular characteristic rather than an entire record.

For tables, the granularity of data cells raises the risk of re-identification. Too much detail leads to a “pseudo-microdata” file. A good rule of thumb is to not publish tables with more than 100 cells. Skewed distributions are another concern, even for less detailed tables and/or larger geographies. Treating the records at risk before producing tabulations is preferable to having to suppress cells.

Strategies used for public use files include subsampling, setting thresholds for identification of geographic areas and categories, additional data swapping for “special uniques,” noise infusion, and synthetic data. The threshold for identification of geographic areas is 100,000 (population size). The threshold for categories is 10,000 nationally. A special unique case will stand out even with a large sample size, so additional swapping is done for such cases. Noise infusion is used for age, with some constraints.

For special tabulations and research papers, we evaluate the detail of the tables and the underlying data to avoid inadvertent disclosure. We round the data to protect small cells and coarsen the detail. In some cases we impose a hard rule such as publishing no detail for a given characteristic below the county or state.

A group has been meeting to anticipate new disclosure risks. The Census Bureau is working on a microdata analysis system that allows tabulations off the entire data file, but with certain restrictions and protections, as an alternative to public use files. We are also asking if we can create a bridge between public use files and RDCs. Is there a middle ground?

Citro: We need to first understand the probabilities of data disclosure and what the actual effects of a disclosure may be. Why develop hypothetical scenarios where a neighbor knows so much about a sample member that they could pick that person out from a public use file?

Barry Johnson:

I represent the statistics arm of the IRS. We have data from tax returns and other documents filed with the IRS (we do not have survey data). We produce tabulations and analyses for the general public, and a couple of public use files. The individual tax public use file has existed since 1962. Our public use data has been the core of tax and economic modeling for the Congressional Budget Office, the Urban Institute, and the National Bureau of Economic Research.

The IRS works with the Federal Reserve Board to plan disclosure protection of the data collected in the Survey of Consumer Finances. Tax data is releasable because there are not many demographic pieces of data on the 1040 form, and this makes the intruder’s job more difficult.

Data is constrained by accounting rules, so it is difficult to perturb. Because of the alternative minimum tax rules and other complexities, it is important to preserve these relationships in the data. Non-linear relationships also make the data hard to adjust. We remove obvious identifiers and rely on a portfolio of techniques to protect especially vulnerable variables.

We have partnered with experts in disclosure limitation and with fellow agencies to protect data and variables. Based on these reviews, the individual tax public use file is updated regularly. We then evaluate how effective the changes have been. Having access to the full population dataset makes evaluation or simulation effective (you can match the public use file to the population data to assess risk).

Allison Oelschlaeger:

CMS has mostly administrative data and less voluntary or survey data. Our office was formed a few years ago to maximize data for internal and external users. CMS policy is inclined toward aggregated data; we don’t publish cell sizes of 10 or less.

CMS produces two types of de-identified data products, and it was a struggle to create useful de-identified products. (1) Stand-alone public use files of basic Medicare claims data—we remove direct identifiers and look carefully at indirect identifiers. (2) A synthetic file. This is a good way for researchers to develop expertise before doing research with the actual data.

Regarding access to researchers for files with identifiable information: historically this data is encrypted and sent via hard drive, with a DUA. Last year we launched a “virtual RDC”—you can submit a research protocol, conduct the research, and then any outputs are reviewed/cleared by CMS. Researchers don’t have to satisfy security requirements at their own facilities this way.

Eve Powell-Griner:

Most of NCHS’s data is survey data, but we also have vital statistics records and physical exam data. We rely on Working Paper 22 and standard limitation techniques. What is different: we think about disclosure limitation from the get-go. We identify potential problems with data each year and discuss them with our review board.

We have been developing on-line resources during the past few years. About 95 percent of our data is released as public use files. For the remaining 5 percent, we require a DUA and designated-agent agreements (for other federal agencies and researchers supported by organizations with security protections), and use of the data in RDCs, which can be used either in person or remotely.

NCHS is becoming somewhat more conservative in what is being released in public use files (for example, geography fields). There is a trade-off between accessibility and control. None of our data is inaccessible except for personally identifiable information.

We focus on rare characteristics that would be identifiable, and we are sensitive to rare information fields. NCHS has deployed new software to extend risk assessment and assign a probability of disclosure. In addition, we need to keep the genetic data collected in the National Health and Nutrition Examination Survey under tight control.

Fritz Scheuren:

Advantage: the variety of disclosure prevention techniques available. Disadvantage: the extent of the variety available.

I agree that it is time to update Statistical Policy Working Paper 22; it should be updated every five years (at least that frequently).

There is a “civil war” going on: the data quality people are at war with the information quality people. The user can’t rely on tables; they want to use it in a microdata simulation model.

To date we have only done a level-one fix, which is not really enough. We haven’t gone beyond this, due to resource constraints. We are not keeping up with the prey-predator problem. Whatever we do to protect the data will eventually be defeated. And there need to be penalties for intruders. It is important to note that with public use files, there is no contract, no DUA.


Citro: Mosaic effect is a term like “big data.” The more multivariate you are, the greater the risk of disclosure. We need to be careful not to be too restrictive for public use files, especially for variables that hold little re-identification value, as some of these variables hold great research value.

Johnson: Revealing that a person filed a tax return is considered a disclosure by the IRS; we set a high bar to prevent disclosure. So how do we balance transparency and confidentiality? By working in cooperation with the users. We formed a user group, and we ask them to help us make choices. Two outside users helped develop the updated version of the public use file, which increased utility and strengthened protection. Their participation helped justify removing the geographic variable; “there is no zealot like a convert.” Users worked through the process to make a better balance among the trade-offs.

Oelschlaeger: Recently, CMS has focused on aggregated files rather than de-identified files. The stand-alone de-identified public use files are so focused on removing all variables that could lead to re-identification that the files are useless. My boss calls them public useless files. Earlier this year CMS released aggregated data at the physician level. Historically, a 1979 injunction prevented Medicare physician payment disclosure. Dow Jones filed suit to overturn this injunction, and a judge agreed. After the injunction was overturned, CMS had to determine what to release at the physician level, and we have since published a file with National Provider Identifier-specific data. We give more weight to beneficiary privacy than to physician privacy in this data release.

Asiala: There is transparency within the agency as well as outside of it. We need a better solution for transparency plus protection. It is becoming more important that subject-matter experts work with the statistical experts to improve the effectiveness and reasonableness of de-identification practices.

Powell-Griner: Federal statistical agencies are trying to be more responsive to users. We get feedback from data users who report what data fields they want, and we try to make appropriate trade-offs. Maybe they will get access, but not as conveniently as perhaps they would like.

Scanlon: CMS—what if a data aggregator asks for data? Do you only allow it for research?

Oelschlaeger: CMS has a number of ways to share data, only one of which is for researchers. HIPAA has a concept of Limited Datasets (this is what commercial researchers get). The Qualified Entity program in the Affordable Care Act gives CMS the authority to release data for quality improvement/performance measures, provided the entity that is receiving the CMS data is combining it with other payer data. Any disclosure of identifiable data requires a DUA.

Scheuren: The mosaic effect comes into play when someone extracts data and tries to match it to another dataset. Billions of records in the insurance world are used for data mining. This is largely a good thing, but there are downsides (such as hackers). We have a trust system, but you need a trust-but-verify system. There are three things to do: penalize people, enforcement, and scale back overzealous confidentiality.

The upper tail of the CPS income distribution is so bad that the disclosure protection isn’t really necessary—wasting resources.

Johnson: IRS needs a legislative change to allow a DUA that would put the responsibility on users. Right now, our data must be completely safe, or it cannot be made available.

Turek: How do you find hackers so you can punish them?

Barth-Jones: You could use computer forensics, and ensure that re-identification has legal consequences (more whistle blowing), and impose a penalty for false positives.

The postal code presents big problems—it is too detailed.

Connie: Communities need to be involved in the process, as they may help to identify variables that pose a high re-identification risk.

El Emam: DUAs only solve the issue of local adversaries. Children are the most easily re-identifiable because they reveal a lot of personal information. But their parents also reveal a lot of information about them (through social media).

Love: The Washington State problem is a good, real world example of how multiple layers of data privacy management are necessary.

El Emam: I have never seen any evidence that predictive analytic firms are trying to re-identify individuals.

View full report


"rpt_Disclosure.pdf" (pdf, 1.01Mb)

Note: Documents in PDF format require the Adobe Acrobat Reader®. If you experience problems with PDF documents, please download the latest version of the Reader®