Minimizing Disclosure Risk in HHS Open Data Initiatives. 3. Record Linkage


When two files contain some of the same individuals, the records common to the two files can be linked if the two files also contain some of the same variables. When the two files contain unique and valid numeric identifiers, the records can be linked using “exact matching” on those fields—as is commonly done when files contain Social Security numbers. When the conditions for exact matching are absent but other, non-unique or imperfect identifiers are present, either “probabilistic record linkage” or distance-based matching can be used as an alternative.

Probabilistic record linkage separates all possible combinations of records into likely matches, likely non-matches, and a group that cannot be confidently assigned to either and would require a manual “clerical” review to determine in which category they belong. Probabilistic record linkage is often applied to link records based on names and addresses, where duplicate names and spelling errors are possible and people may have been living at different addresses when the two files were created. The Census Bureau uses probabilistic record linkage to unduplicate the records collected in the decennial census, as some people may have been enumerated multiple times.

Distance-based matching may be used instead of probabilistic record linkage when one or both files contain no explicit identifiers but the two files contain quantitative variables—such as income. Different distance functions may be used. The Euclidean distance is commonly used because of its simplicity and general effectiveness. It can also be used to match a single observation to a dataset, simulating an intruder who is trying to find a single, target individual. Torra et al. (2006) explored an alternative metric, the Mahalanobis distance, with notable success, although this approach is not nearly as straightforward to implement and is designed for matches between two datasets.

While record linkage is typically applied to variables that are common between two files, it is also possible to apply record linkage methods to files with no variables in common. One approach relies on correlations between variables (see, for example, Domingo-Ferrer and Torra 2003, which employs clustering methods). The effectiveness of such matching increases with the strength of the correlation between variables and the degree of overlap between the two files—that is, the percentage of records appearing in both files. An alternative approach applied by Torra (2000) uses a method called ordered weighted aggregation.

Record linkage techniques other than exact matching are computationally intensive because they entail “looking at” all possible pairs of records to determine a most likely match in one file for each record in the other file. When probabilistic record linkage in its present form was introduced (see Fellegi and Sunter 1969) and for many years afterwards, it was common to subset or “block” the files being linked to reduce the computational time. Only those potential links within the same block were evaluated. This precluded the possibility of identifying links across blocks. As processing capacity and computational efficiency have increased, this constraint has largely disappeared. For example, whereas the Census Bureau had to employ blocking when searching for duplicates in the decennial census file in earlier censuses, the bureau routinely matches entire population files when performing record linkage for census unduplication and other purposes.6 This capability is accessible to potential intruders as well and represents perhaps the most important dimension in which the risk of re-identification in public use files has increased.

6 Blocking remains useful as a strategy for reducing the number of false matches. For example, while gender may be recorded incorrectly on occasion, allowing matches that disagree on gender may produce far too many false matches to justify the few additional true matches that it might detect. However, the use of blocking for the sole purpose of reducing computational time is becoming less and less common.

View full report


"rpt_Disclosure.pdf" (pdf, 1.01Mb)

Note: Documents in PDF format require the Adobe Acrobat Reader®. If you experience problems with PDF documents, please download the latest version of the Reader®