Jacob Bournazian, Energy Information Administration
John Czajka, Mathematica
Much of this is what I wrote in Statistical Policy Working Paper 22, chapters 3 to 5.
Microdata is highly sensitive to identity disclosure if:
- Data are from administrative records
- Categorize so there is no unique combination of variables
- Or add some type of disturbance to the data
- The microdata contains a “population unique”
- Relate the distribution of sample uniques to the distribution of population uniques
- The file is linkable to administrative record files or other exogenous data files
The bottom line is this. A file is adequately protected if the disturbed microdata cannot be successfully matched to the original data file or to another file with comparable variables.
EIA regularly combs through social media and newspapers to identify sources of microdata disclosure risk. All government agencies are affected by the impact of “sensational events.” Catastrophic events and sensational news stories create re-identification risks that can exceed the capabilities of the methodologies applied to a file. For example, in April 2010 there was a mine explosion in West Virginia. Workers compensation files were eventually released, and although they were run through software that anonymizes microdata, it was easy to figure out who the 29 deceased miners were in the data.
The first step in protecting microdata files is to remove direct identifiers. Direct identifiers are usually personally identifiable information (PII). Examples include:
- Names and addresses
- Small geographic subdivisions
- Telephone or fax numbers
- E-mail addresses
- Social Security numbers
- Medical record numbers
- Health plan numbers
- Patient record numbers
- Account numbers
- Certificate or license numbers
- Vehicle identifiers and license plate numbers
- Medical device and serial numbers
- Personal URLs or websites
- Internet Protocol address numbers
- Biometric identifiers (finger and voice prints)
- Full face photographic images
- Any other unique identifying number, characteristic, or code
The second step is to assess and modify indirect identifying information. See pages 12 to 15 of the Mathematica background paper that was provided to the attendees for a summary of techniques to avoid disclosure. Three broad types of microdata disclosure limitation methods for protecting data are: (1) data reduction, (2) data modification, and (3) data creation.
Considering the overall file, three data reduction options are:
- Do not release microdata (that is, only release tabular data)
- Only release a sample of a data file. A census is more likely to disclose a respondent than a sample of respondents
- Only release a selection of variables. Remove sensitive variables (especially direct identifiers and certain indirect identifiers)
Considering the respondent record, options for data reduction include:
- Delete highly unique respondents from the file
- Identify outliers or sensitive values within a respondent’s record and set them to missing (local suppression)
Considering individual data fields or variables, choices include:
- Truncate variable distributions—that is, top or bottom code
- Do not rely on top coding or bottom coding a fixed percentage of records, however. Check the frequency distribution of variables identified for top or bottom coding. Since policies are built around quartiles, this can bring up some issues.
- Recode threshold values with measure of central tendency—for example, using the mean or median
- Recode or collapse a number of categories in a variable distribution
- Round variable values
Data modification encompasses a number of techniques. Perturbation or noise addition involves, for a particular sensitive variable or subset of respondents, adding “error” to the information. In general, the error added is randomly assigned and should have a mean of zero and a known variance. It is useful to check the file for the percentage of records where more noise is added than the threshold level or less noise is added than the threshold level.
Data swapping and data shuffling—a similar technique—are commonly used with categorical variables, but they can also be used with continuous variables. One approach is to first sort the values of continuous variables; values close in rank are designated as pairs and are swapped between the two records. With an alternative approach, some percentage of records is matched with other records in the same file on a set of predetermined categorical variables, and the values of the variables are swapped between the two records. In either case, pay attention to the frequency distribution in deciding how many values to swap. This is more important than the percentage of data swapped.
Data creation implies replacing the actual data with imputed (or synthetic) data. This is done by first constructing an imputation model, then running the model using the original data, then creating new distributions of the variables that you are trying to protect. Only the created variable is released. The quality of the inferences from synthetic data is only as good as the specificity of the imputation model and the validity of the data used. There are no really good measures to assess the quality of the imputation of synthetic data. One can apply this approach to missing values, select variables, or the entire file.
Federal statistical agencies do a good job of securing their data files. A critical question is whether your microdata file can be matched to an external data base. Link Plus software is available from the Centers for Disease Control and Prevention (CDC) at: www.cdc.gov/cancer/ npcr/tools/registryplus/lp_tech_info.htm.
Record linkage using non-unique identifiers can be rule-based, distance-based, or involve the use of string comparators. Examples of potential matching variables include gender, month, year of an event or treatment, ZIP code, education, or medical condition. As an aside, while it is true that because of increased computational power, it may not be necessary to block on particular variables (that is, restrict potential matches to records with common values on key fields), blocking remains important as a way of minimizing false positives. The priority is to modify blocking rather than to eliminate it.
If records on the file can be matched to an external data base, then the file should be modified. That is, the variables that contribute to the match would need to be deleted or changed in order to prevent matches or disrupt the blocking strategy.
Examples of popular administrative data sources for linking include health billing records, Social Security records, cancer registries, voter registration records, birth and death records, real property tax records, and health insurer claims data.
Not all agencies follow the same protocols and checks, but a best practice for checking data quality after the application of protection is to conduct statistical analysis comparing the original data to the protected data to ascertain quality, usefulness, and the ability to make unbiased inferences. For example, one can run “before and after” analyses for:
- Univariate statistics (means, skewness)
- Bivariate statistics (correlation, association)
- Multivariate statistics (model parameters)
The univariate and bivariate statistics are an obvious thing to check, but the use of multivariate tests needs to increase.
Finally, there are a number of limitations of Statistical Policy Working Paper 22:
- Breaking the link between the person and identification in a file is not enough in the digital age environment.
- Anonymity cannot be sustained even if anonymity can be preserved in a data file.
- Re-identification versus reachability; the role of predictive analytics keeps increasing.
- Risk assessment needs to change to include classes of persons. We need to include the kind and intensity of harm that people are exposed to by governments basing their decisions on algorithms that identify correlations in files.
False positives are a new kind of harm. If you modify the data, make sure that people are not re-identified inaccurately.
There are “3 C’s”—coincidence, causation, and correlation. Correlation is dominating these days. This is a new age in data confidentiality – we need to go beyond merely preventing re-identification.
To test the effectiveness of the disclosure protection by matching the public use file back to the original data, the matching is typically done with just a subset of variables—those that might be available to a potential intruder. This involves some judgment about what variables are “out there.” It would be too easy to re-identify records with the original data if all fields were used, but it would not provide a realistic measure of risk.
Sensitive variables may not be highly correlated. Perturbing them independently may not be very effective in preventing re-identification. There are procedures for multivariate masking. Unfortunately, one may have to do a lot of damage to protect the data.
When introducing additive noise, it may be useful to assign the noise with a “donut” distribution. The Census Bureau explored the use of additive noise in the context of tabular data—work described by Laura (then Zayatz) McKenna. First you determine at random whether to make a positive or a negative adjustment to a value and then assign random noise with a nonzero mean and specified variance. This assures that most records are altered at least a small amount, but the mean noise is still zero.
How much data swapping do you need to do? One view is that much of the effectiveness of swapping comes from introducing uncertainty about whether a value was swapped or not. It may not be necessary to swap a very high percentage of values for swapping to be effective.
A strategy in releasing synthetic data is to give users the opportunity to have their final estimates run on the real data. This provides a way of validating the users’ findings. This is not always viable, however. A number of years ago the Statistics of Income division of the Internal Revenue Service (IRS) set up a data users group. An important application of public use tax data is in microsimulation models. The thinking was that the IRS could test the impact of different masking strategies on the quality of the data by bringing one of these models into the IRS and running alternative public use datasets through the model in order to compare results. The users explained that a dataset requires months of work before it can be used in a model. It wasn’t feasible to test alternative datasets by simply running them through the model.
Agencies need to be concerned about not just current threats but future threats. Once you release a dataset, it’s out there. You cannot take it back.
Joan Turek: You don’t want to manipulate the data so much that you take away its usefulness to the user. Synthetic data have not provided acceptable estimates of the characteristics of small groups who are critical to policy making—for example, groups such as unwed mothers, SSI recipients, the disabled. Good estimates of small characteristics that permit detailed cross-tabs are needed. Groups such as these account for large amounts of federal expenditures and need to be accurately portrayed. People using data often have turnaround deadlines that are too short to allow them to wait for runs on the fully accurate data.
Scanlon: We found that the more we use techniques that modify data, the more disturbed the researchers become, and they lose confidence in the data. There is an anecdote of trying to produce estimates for a regulation. We got lots of pushback from the public because they didn’t understand the statistical adjustments we had made.
Bournazian: With non-disturbance methods, the danger is high of re-identifying certain persons. Or of making false positives through predictive analytics. We have additional responsibility to protect people from these erroneous re-identifications.
Turek: We don’t publish individual data, but there is a need for the data to be valuable/useful. It’s a two-sided coin.
Steve Cohen (AHRQ): We can use research data centers to balance the two principles. One limitation is that geographical size reduces the ability to protect individuals. Some say there is “no zero risk of disclosure.” We hear that, but we try to minimize risk. We should be more concerned about false positive identification by faulty re-identification. We are legally obligated to protect against true positive identification but technically not responsible for false positive identification. What liability do federal agencies have?
Bournazian: The main limitation of Statistical Policy Working Paper 22 in the section on risk assessment is its focus entirely on re-identification. We can’t be blind to the era we’re living in. And talking about Open Data, there is a software package that takes audio information and converts it into digital. It not only digitizes the data; it takes digitized data and converts it into a computational format, which is more powerful. It’s happening all around in multiple subject matters. Someone can be included in a loan application, and predictive analytics say that this person is a high risk, which may result in a loan denial even though the consumer had paid on time to date. Someone’s kid can be identified as not making it into an IB program because predictive analytics says he is not a performer. We can get a lot of these events in health data. Someone could be suffering from a medical condition—say a bone fracture—and had to stay in the hospital and was prescribed medicine, and an insurance company makes its decision on coverage based on the file.
Brian Harris-Kojetin (OMB): See Hermann Habermann’s article on distinguishing confidentiality versus group harm. For example, survey responses may result in effects (such as where to site a hospital); we cannot guarantee that there will be no such harm. That’s a totally different issue than protecting confidentiality.
John Eltinge (BLS): Take the RDC as a gold standard. There is a tradeoff between disclosure risk and data quality. What about the cost/burden on the analyst? Has anyone studied the incremental cost/burden of going to an RDC?
Scanlon: There are burdens involved with going to an RDC and complying with data use agreement stipulations. Since risk rises with public use data, we need to increase restrictions on the data content. There is a spectrum of availability.
Margo Schwab (OMB): Data quality is an issue of marginally lower data quality. Minor perturbations may have very little effect on certain analyses. Data quality usually refers to major problems, rather than precision with very specific analyses. CMS is experimenting with virtual data centers—this is another part of the spectrum. I hope today’s meeting helps us figure out what are the right places on the spectrum in matching datasets to uses. We should push boundaries farther in coming up with creative ways to make data both available and protected.