A critical element in preparing a public use file of microdata is the assessment of disclosure risk, which may involve estimating the probability of re-identification. Often this is an iterative process, in which a preliminary file is tested and if the risk is determined to be too high, additional protective measures are applied. For the MASSC method, described above, risk assessment is incorporated into the disclosure limitation process.
When microdata protection is based on k-anonymity, the assessment of disclosure risk involves determining if k-anonymity is satisfied. Ideally, this is done with population data, but strategies applicable to sample data exist, as noted above.
When public use files contain numerous variables or include continuous variables, sample uniqueness across the range of variables is almost assured. Under these circumstances a different approach to assessing disclosure risk is required. Commonly, this involves using one or more alternative files with identifiers and attempting to match records on the public use file. NCES is able to exploit this strategy because there are publicly available lists of schools, and these lists include selected characteristics (Federal Committee on Statistical Methodology 2005). The accuracy of unique matches can be measured and, depending on the results, the data producer may decide to exclude high-risk records from the public use file or increase the level of masking on these records to prevent matches. If the accuracy of unique matches is sufficiently low, and there is no indication that correct matches can be differentiated from the vastly greater number of incorrect matches, the data producer may conclude that deleting or further masking the records that were matched correctly is not necessary. However, correct matches from publicly-available data do provide direct evidence of vulnerability.
The most rigorous way to assess disclosure risk is to attempt to identify records in the public use file from the source records in the original or internal file. The rigor in this approach comes from two factors. First, the only data divergence between the public use file and the internal file is that which was created deliberately to reduce disclosure risk. Second, unless the public use file was subsampled from the internal file, the overlap in records between the two files is 100 percent, which is analogous to an intruder knowing with certainty who is included in the public use file. Because these factors can make re-identification rather easy, data producers that use this approach must introduce some limitations on the match attempt to produce a realistic assessment of risk. Typically, this involves first determining what variables from the internal file might be available to a potential intruder, as it is likely that all or nearly all of the records in the public use file could be re-identified if the full set of variables appearing in both files were used in the attempt. Data producers also need to account for the impact of sampling if the internal file is itself a sample of the full population. The reduction in disclosure risk if the public use file is a subsample of the internal file should be reflected in the results of the match attempt.