If a dataset that is protected against disclosure can be described as one in which a user cannot determine anything about a given individual from the dataset that could not be determined without the dataset, then by the same principle, preserving utility could be described in the following terms: the data should be no less useful to the users than if statistical disclosure limitation had not been applied. Arguably, no dataset to which any measure of disclosure protection has been applied can meet this standard. Preserving the utility of the data is a prominent topic in the statistical literature but much less so in the public discussion of data security. In the statistical literature, research has focused on measuring the information loss due to the application of the protective measures, but recent research is exploring the problem of minimizing risk and maximizing utility. This is illustrated by Shlomo (2010), who argues for a coordinated analysis of disclosure risk and information loss by data producers in order to maximize the analytic utility of the public use data consistent with providing the desired level of protection.
Purdam and Elliot (2007) classify the impact of statistical disclosure limitation on data utility into two categories: (1) reduction of analytical completeness and (2) loss of analytical validity. The former implies that some analyses cannot be conducted because critical information has been removed from the file. The latter implies that some analyses will yield different conclusions than if they had been conducted on the original data.
Procedures applied to protect the data may seriously reduce their usefulness or, worse, lead to incorrect inferences about a population or subpopulation. A recent example underscores this concern. The Census Bureau makes extensive use of data swapping, in which the values of selected variables are exchanged between respondents. For several years, errors in the program used to swap data in the Current Population Survey Annual Social and Economic Supplement and the American Community Survey produced a significant distortion in the age-sex composition of the elderly population, which was eventually noticed by users (Alexander et al. 2010).
As a general principle, statistics computed from the protected dataset should not differ significantly from the statistics obtained from the original dataset. An approach to measuring information loss is to compare statistics—totals, means, medians, and covariances—between the public use data and the source data. Some of the methods of statistical disclosure limitation discussed in the previous chapter have been shown to protect certain statistics—in particular, totals—or to introduce less distortion into covariances than other methods. Consider, for example, top coding. One can assign top codes in such a way that the original totals are preserved (by assigning the mean of the top coded values as the top code), but this will not extend to variances, which will be reduced.
Shlomo (2010) reviews several approaches to measuring information loss. These include distance metrics, impacts on measures of association, and impacts on regression analyses. Because statistical disclosure limitation introduces error into the data, measures of goodness of fit may capture information loss particularly well. Depending on how disclosure limitation affects the data, the error added may reduce between-group variance and increase within-group variance in regression analysis or ANOVA. Alternatively, it is possible that disclosure limitation may artificially increase between group variance, creating more association than was present in the original data. Calculating a range of information loss measures will enhance the data producer’s understanding of the impact of disclosure limitation on the analytic utility of the data.
Measures of data utility provide a basis for assessing and comparing alternative methods of statistical disclosure limitation as well. Woo et al. (2009) present several global (as opposed to analysis-specific) measures of data utility for masked data. They compare measures based on empirical distribution estimation, cluster analysis, and propensity scores, and they find that the measures based on propensity scores appear to hold the most promise for general use. In their analysis, the measure based on propensity scores exhibits the expected behavior with increasing levels of alteration of the original data, and it differentiates among alternative masking strategies more effectively than the measures based on the other two approaches.