Preventing re-identification is the primary focus of de-identification and the statistical disclosure limitation methods discussed in this section, but a secondary objective (or result, if not necessarily an objective) is to limit what an intruder might learn from an apparent re-identification. Techniques that alter data values or exchange values between respondents contribute to both goals, and when users are informed of these techniques, their application may also discourage potential intruders from attempting to re-identify records.
a. Overview of Methods
Statistical disclosure avoidance techniques for microdata have been well developed and widely published in journals, textbooks, and workshop and conference proceedings. The following two sources proide comprehensive accounts of these techniques: (1) Statistical Policy Working Paper 22 (FCSM 2005); and (2) Handbook on Statistical Disclosure Control (Hundepool et al. 2010). Techniques to protect microdata for public release include those approaches pertaining to the file in general and those related to variables within the file. For example, Statistical Policy Working Paper 22 identified the following approaches:
- Include data from only a sample of the population
- Do not include obvious identifiers
- Limit geographic detail
- Limit the number and detailed breakdown of categories within variables on the file
- Truncate extreme codes for certain variables (top or bottom coding)
- Recode into intervals or round continuous variables
- Add or multiply by random numbers (adding noise)
- Swap or rank swap the values on otherwise similar records (also called switching)
- Select records at random and blank out selected variables and impute the missing values (also called blank and impute)
- Aggregate across small groups of respondents and replace each individual’s reported value with the average
It is essential for an agency releasing a public use microdata file to remove all obvious identifiers from the individual records. However, some respondents have characteristics or combinations of characteristics that make them stand out from others. As demonstrated in the previous chapter, de-identification alone is not sufficient to eliminate the risk of disclosure.
The methods described next can be used to lower disclosure risk from released microdata. Some of these methods are suitable only for categorical variables, or only for continuous variables, whereas others can be applied to both types of variables.
Nonperturbative methods. These methods do not alter data values; rather, they implement partial suppressions or reductions of detail in the original dataset. These techniques include the following:
- Sampling: releasing a subsample of the original microdata
- Global recoding: combining several categories to form new, less specific categories
- Top and bottom coding: combining values in the upper (or lower) tail of a distribution (a special case of global recoding)
- Local suppression: suppressing the values of individual variables for selected records so that no information about these variables is conveyed for these records
Sampling introduces or increases the uncertainty that a particular individual is included in a microdata file, and in doing so it provides a strong disincentive for a would-be intruder to attempt to re-identify records on the file. Sampling can produce a very strong disincentive if the intruder has access to identified records for only a small subset of the population, as there may be no overlap between the two files—that is, no records included in both files. On the other hand, sampling will provide less of a disincentive for attempted re-identification if the intruder has data on the entire population, as the intruder can be nearly certain that every record in the public use microdata file is represented in the population data.
Perturbative methods. With these methods, values in the microdata are distorted, but this is done in such a way that key statistical properties or relationships in the original data are preserved. These techniques include the following:
- Noise addition: random noise technique is to add or multiply the original value by random numbers
- Data swapping: selecting a sample of records, finding a match in the database on a set of predetermined variables, and swapping all other variables
- Rank swapping: unlike regular swapping, in which the match/pair is defined based on exact match, in rank swapping the pair can be defined to be close based on their proximity to each other on a list sorted by the continuous variable; frequently the variable used in the sort is the one that will be swapped
- Shuffling: like shuffling a deck of cards, the values of a confidential variable are reordered in a way that preserves the correlation between the confidential variable and a non-confidential variable while also preserving the correlation between the rank order of the confidential variable and that of a non-confidential variable in the original data
- Rounding: replace the original values of variables with rounded values
- Resampling: for a variable in the original data, a new variable for released data is created in which the values of this new variable are calculated as the average of a set of resampled values from the original variable
- Blurring: replacing a reported value (or values) by the aggregate values (for example, the mean) across small sets of respondents for selected variables
- Microaggregation: a form of data blurring in which records are grouped based on a proximity measure of all variables of interest, and the same groups of records are used in calculating aggregates for those variables; Domingo-Ferrer and Mateo-Sanz (2002) note that microaggregation provides a way to achieve k-anonymity with respect to one or more quantitative attributes
- Post-randomization method or PRAM (Gouweleeuw et al. 1997): a probabilistic, perturbative method for a categorical variable; in the masked file, the scores on some categorical variables for certain records in the original file are changed to a different score according to a prescribed probability mechanism, namely a Markov matrix
- Micro agglomeration, substitution, subsampling and calibration, or MASSC (Singh et al. 2004): this creates sets of identifying variables (called strata) to find records that might be at risk of disclosure (that is, unique records); calculates a disclosure risk measure for each stratum (unique records are also assigned a disclosure risk associated with that stratum); an overall measure of disclosure risk can be calculated for an individual record and for an entire database by collapsing over the strata
- Synthetic microdata (Rubin 1993): Some or all of the variables in the original dataset are replaced with imputed (synthetic) variables developed from models based on the original data; while certain statistics or internal relationships in the original dataset are preserved, the synthetic variables do not represent actual individuals
Even when the microdata have been protected using one or more of these statistical disclosure limitation techniques and are perceived to be safe for release, the risk of re-identification, in all likelihood, is still not zero.10 The level of risk depends on the amount of information or knowledge available to the intruder and how adept the intruder is at matching this information to the microdata in question.
b. Recent Advances in Protecting Microdata
Much of the recent research on protecting microdata has focused on how the usefulness of the data is affected when methods of statistical disclosure limitation are applied. This topic is addressed in the next chapter. Research on ways to improve the protection afforded to public use microdata has sought ways to enhance existing approaches rather than develop entirely new approaches.
Singh (2009) proposes an enhanced version of MASSC that generalizes the risk measures used in altering the data to encompass cases with “partial risk,” defined as having risk scores between 0 and 1. All records with nonzero risk are subject to treatment (that is, alteration of data values), but only a random subset is actually treated. Both disclosure risk and information loss are assessed in developing the final dataset.
Machanavajjhala et al. (2005) show limitations of k-anonymity in two situations: (1) one in which the k individuals are homogeneous with respect to particular characteristics, resulting in attribute disclosure; and (2) one in which the intruder possesses background knowledge that makes it possible to differentiate between the target individual and the k-1 other individuals. To overcome these limitations, the authors propose the concept of l-diversity, which requires that the values of sensitive attributes be well-represented in each group. Further work will focus on extending the concept of l-diversity to multiple sensitive attributes and to continuous sensitive attributes.
Efforts to improve the quality of synthetic data have received attention as well. Zayatz (2008) notes that this is one of three areas of current research on disclosure avoidance at the Census Bureau (the other two being the use of noise addition for tabular magnitude data and the development of a system for remote microdata analysis). The Census Bureau uses the synthetic method to produce two databases that incorporate data from administrative records and is also applying synthetic methods to produce group quarters microdata from the American Community Survey. One of the challenges in generating synthetic data from real data, Zayatz notes, involves dealing with structurally missing values—for example, children ever born to males. The level of complexity required to incorporate structural zeroes into the modeling exceeded the capacity of the methods used at the time. Instead, values were imputed to the structural zeroes and later removed. Enhancements to the methodology include a multi-level modeling of parent-child relationships that incorporates all of the constraints.
10 Theoretically, a fully synthetic file has no risk of disclosure because none of the records corresponds to an actual person. Concerns about synthetic data focus almost exclusively on their usefulness for analysis, a concept discussed below. Nevertheless, if a synthetic file mimics the original data sufficiently closely, it can still reveal information about the individuals in the original data. In other words, if a synthetic file captures relationships in the original data so well that it is highly useful analytically, it may also carry some disclosure risk.