Minimizing Disclosure Risk in HHS Open Data Initiatives. 4. Assessing Disclosure Risk

09/29/2014

A useful way to view disclosure risk was expressed by Marsh et al. (1991): the probability of disclosure is the product of two terms: (1) the probability of a successful re-identification conditional on someone trying to re-identify a record and (2) the probability that someone will try to re-identify a record. A data producer can lower the risk of disclosure by reducing either of these probabilities. For example, charging a high fee for a public use file reduces the probability that a potential intruder will even acquire the file. Sampling reduces the certainty that someone of interest is included on the file, which will also discourage potential intruders. Altering the data values in various ways reduces the likelihood of a re-identification, and publicizing the fact that such measures were applied may also discourage attempts at re-identification. Altering the data also reduces the potential value of the information gained by re-identification, which may further reduce the likelihood that a would-be intruder will attempt a re-identification.

When microdata protection is based on k-anonymity, the assessment of disclosure risk involves determining if k-anonymity is satisfied. Ideally, this is done with population data, but strategies applicable to sample data exist, as noted in Chapter III.

When public use files contain numerous variables or include continuous variables, sample uniqueness across the range of variables is almost assured. Under these circumstances a different approach to assessing disclosure risk is required. Commonly, this involves using one or more alternative files with identifiers and attempting to match records on the public use file. The accuracy of unique matches can be measured and, depending on the results, the data producer may decide to exclude high-risk records from the public use file or increase the level of masking on these records to prevent matches. If the accuracy of unique matches is sufficiently low, and there is no indication that correct matches can be differentiated from the vastly greater number of incorrect matches, the data producer may conclude that deleting or further masking the records that were matched correctly is not necessary. However, correct matches from publicly-available data do provide direct evidence of vulnerability.

The most rigorous way to assess disclosure risk is to attempt to identify records in the public use file from the source records in the original or internal file. The rigor in this approach comes from two factors. First, the only data divergence between the public use file and the internal file is that which was created deliberately to reduce disclosure risk. Second, unless the public use file was subsampled from the internal file, the two files will contain the same records, which is analogous to an intruder knowing with certainty who is included in the public use file. Because these factors can make re-identification rather easy, data producers using this approach must introduce some limitations on the match attempt to produce a realistic assessment of risk. Typically, this involves first determining what variables from the internal file might be available to a potential intruder, as it is likely that all or nearly all of the records in the public use file could be re-identified if all of the variables appearing in both files were used in the attempt. Data producers also need to account for the impact of sampling if the internal file is itself a sample from a larger population.

View full report

Preview
Download

"rpt_Disclosure.pdf" (pdf, 1.01Mb)

Note: Documents in PDF format require the Adobe Acrobat Reader®. If you experience problems with PDF documents, please download the latest version of the Reader®