Statistical disclosure avoidance techniques for microdata have been well developed and widely published in journals, textbooks, and workshop and conference proceedings. The following two sources provide comprehensive accounts of these techniques: (1) Statistical Policy Working Paper 22 (FCSM 2005); and (2) Handbook on Statistical Disclosure Control (Hundepool et al. 2010). Techniques to protect microdata for public release include those approaches pertaining to the file in general and those related to variables within the file. For example, Statistical Policy Working Paper 22 identified the following approaches:
- Include data from only a sample of the population
- Do not include obvious identifiers
- Limit geographic detail
- Limit the number and detailed breakdown of categories within variables on the file
- Truncate extreme codes for certain variables (top or bottom coding)
- Recode into intervals or round continuous variables
- Add or multiply by random numbers (adding noise)
- Swap or rank swap the values on otherwise similar records (also called switching)
- Select records at random and blank out selected variables and impute the missing values (also called blank and impute)
- Aggregate across small groups of respondents and replace each individual’s reported value with the average
A more complete list of statistical disclosure limitation methods is presented below, divided between nonperturbative and perturbative methods. Some of these techniques are suitable only for categorical variables, or only for continuous variables, whereas others can be applied to both types of variables.
Nonperturbative methods. These methods do not alter data values; rather, they implement partial suppressions or reductions of detail in the original dataset. These techniques include the following:
- Sampling: releasing a subsample of the original microdata
- Global recoding: combining several categories to form new, less specific categories
- Top and bottom coding: combining values in the upper (or lower) tail of a distribution (a special case of global recoding)
- Local suppression: suppressing the values of individual variables for selected records so that no information about these variables is conveyed for these records
Sampling introduces or increases the uncertainty that a particular individual is included in a microdata file, and in doing so it provides a strong disincentive for a would-be intruder to attempt to re-identify records on the file. Sampling can produce a very strong disincentive if the intruder has access to identified records for only a small subset of the population, as there may be no overlap between the two files—that is, no records included in both files. On the other hand, sampling will provide less of a disincentive for attempted re-identification if the intruder has data on the entire population, as the intruder can be nearly certain that every record in the public use microdata file is represented in the population data.
- Perturbative methods. With these methods, values in the microdata are distorted, but this is done in such a way that key statistical properties or relationships in the original data are preserved. These techniques include the following:
- Noise addition: random noise technique is to add or multiply the original value by random numbers
- Data swapping: selecting a sample of records, finding a match in the database on a set of predetermined variables, and swapping all other variables
- Rank swapping: unlike regular swapping, in which the match/pair is defined based on exact match, in rank swapping the pair can be defined to be close based on their proximity to each other on a list sorted by the continuous variable; frequently the variable used in the sort is the one that will be swapped
- Shuffling: like shuffling a deck of cards, the values of a confidential variable are reordered in a way that preserves the correlation between the confidential variable and a non-confidential variable while also preserving the correlation between the rank order of the confidential variable and that of a non-confidential variable in the original data
- Rounding: replace the original values of variables with rounded values
- Resampling: for a variable in the original data, a new variable for released data is created in which the values of this new variable are calculated as the average of a set of resampled values from the original variable
- Blurring: replacing a reported value (or values) by the aggregate values (for example, the mean) across small sets of respondents for selected variables
- Microaggregation: a form of data blurring in which records are grouped based on a proximity measure of all variables of interest, and the same groups of records are used in calculating aggregates for those variables; Domingo-Ferrer and Mateo-Sanz (2002) note that microaggregation provides a way to achieve k-anonymity with respect to one or more quantitative attributes
- Post-randomization method or PRAM (Gouweleeuw et al. 1997): a probabilistic, perturbative method for a categorical variable; in the masked file, the scores on some categorical variables for certain records in the original file are changed to a different score according to a prescribed probability mechanism, namely a Markov matrix
- Micro agglomeration, substitution, subsampling and calibration, or MASSC (Singh et al. 2004): this creates sets of identifying variables (called strata) to find records that might be at risk of disclosure (that is, unique records) and calculates a disclosure risk measure for each stratum (unique records are also assigned a disclosure risk associated with that stratum); an overall measure of disclosure risk can be calculated for an entire database by collapsing over the strata
- Synthetic microdata (Rubin 1993): Some or all of the variables in the original dataset are replaced with imputed (synthetic) variables developed from models based on the original data; while certain statistics or internal relationships in the original dataset are preserved, the synthetic variables do not characterize actual individuals
We note that because of swapping and other techniques, the Census Bureau has been willing to publish tabulations from the decennial long form that, for small geographic areas, included frequency counts as low as 1, even for sensitive variables such as income class.
Even when the microdata have been protected using one or more of these statistical disclosure limitation methods and are perceived to be safe for release, the risk of re-identification, in all likelihood, is still not zero. 13 Chapter IV discusses re-identification risk and its potential sources.
13 Theoretically, a fully synthetic file has no risk of disclosure because none of the records corresponds to an actual person. Concerns about synthetic data focus almost exclusively on their usefulness for analysis, a concept discussed in Chapter IV. Nevertheless, if a synthetic file mimics the original data sufficiently closely, it can still reveal information about the individuals in the original data. In other words, if a synthetic file captures relationships in the original data so well that it is highly useful analytically, it may also carry some disclosure risk.