Minimizing Disclosure Risk in HHS Open Data Initiatives. B. Maintaining the Utility of Public Use Data


Steps taken to preserve the confidentiality of public use data have an adverse effect on the quality of the data and its general usefulness for research. Purdam and Elliot (2007) classify the impact of statistical disclosure limitation on data utility into two categories: (1) reduction of analytical completeness and (2) loss of analytical validity. The former implies that some analyses cannot be conducted because critical information has been removed from the file. The latter implies that some analyses will yield different conclusions than if they had been conducted on the original data. For example, suppressing state geography, as is done for some national databases, precludes analysis of characteristics by state. Adding noise to variables reduces the degree of fit in predictive models and lowers simple measures of association. Swapping, if not monitored carefully, can distort distributions, as was demonstrated recently with multiple Census Bureau household surveys (Alexander et al. 2010).

Preserving the utility of the data is a prominent topic in the statistical literature on disclosure limitation but much less so in the public discussion of data security. In the statistical literature, research has focused on measuring the information loss due to the application of the protective measures. As a general principle, statistics computed from the protected dataset should not differ significantly from the statistics obtained from the original dataset. An approach to measuring information loss is to compare statistics—totals, means, medians, and covariances—between the public use data and the source data. Some of the methods of statistical disclosure limitation discussed in the previous chapter have been shown to protect certain statistics—in particular, totals—or to introduce less distortion into covariances than other methods. Consider, for example, top coding. One can assign top codes in such a way that the original totals are preserved (by assigning the mean of the top coded values as the top code), but this benefit will not extend to variances, which will be reduced.

Shlomo (2010) reviews several approaches to measuring information loss. These include distance metrics, impacts on measures of association, and impacts on regression analyses. Because statistical disclosure limitation introduces error into the data, measures of goodness of fit may capture information loss particularly well. Depending on how disclosure limitation affects the data, the error added may reduce between-group variance and increase within-group variance in regression analysis or ANOVA. Alternatively, it is possible that disclosure limitation may artificially increase between-group variance, creating more association than was present in the original data. Calculating a range of information loss measures will enhance the data producer’s understanding of the impact of disclosure limitation on the analytic utility of the data. Shlomo (2010) argues for a coordinated analysis of disclosure risk and information loss by data producers in order to maximize the analytic utility of the public use data consistent with providing the desired level of protection.

View full report


"rpt_Disclosure.pdf" (pdf, 1.01Mb)

Note: Documents in PDF format require the Adobe Acrobat Reader®. If you experience problems with PDF documents, please download the latest version of the Reader®