To understand the potential sources of disclosure risk requires an awareness of who might attempt to re-identify records in federal microdata, what are their capabilities, and what are their resources, including what data they might use in their re-identification attempts, and what tools are available to assist them in doing so.
a. Potential Intruders
Terms such as intruder, adversary, attacker, and snooper have been applied to describe the individuals who might attempt to re-identify entities in public use data and apply that information in some, possibly malicious way. We will stick with the term intruder.
Nearly all of the documented instances of records being re-identified in public use data have been accomplished by researchers—and for the purpose of demonstrating data vulnerabilities. Researchers have incentives in the form of publication and possible career enhancement aside from contributing to a public good—namely, better protection of public data in the future. Other potential intruders include hackers, persons with access to proprietary information, neighbors, family members, and former spouses. Protecting data against re-identification by family members is a particular concern when data were collected confidentially from other members. The example discussed in Appendix E involves drug use by children. A different kind of threat is presented by former spouses, who may be able to realize a financial benefit by learning of a former partner’s finances and may have extensive information on which to base a re-identification.
b. Auxiliary Data
Following Dwork and Naor (2010), the data that a potential intruder would use to re-identify records on a public use file may be described as auxiliary data. The challenge in protecting a public use file from any possibility of re-identification is the inability to guarantee that there are no auxiliary data in anyone’s possession that would enable re-identification of even one record.
Voter registration lists—such as the one used by Sweeney in Massachusetts—are often cited as a key source of auxiliary data for potential intruders. Such data include identifiers along with demographic and residential characteristics that uniquely identify a lot of people. Benitez and Malin (2009) estimated the risk of re-identification from voter registration lists by state for datasets protected by the HIPAA Safe Harbor and Limited Dataset policies.15 Because voter registration lists vary in cost, the study addresses both the probability of successful re-identification, given the attempt, and cost factors that may affect the likelihood of an attempt. In their study, the Safe Harbor dataset included year of birth, gender, and race while the Limited Dataset policy added county and date of birth. The results showed wide variation in estimated risk and in the unit price of each potential re-identification by state, with substantially greater risk under the Limited Dataset versus the Safe Harbor policy. They concluded that blanket protection policies expose different organizations (in different states) to differential disclosure risk.
Duncan et al. (2011) observe that “the Achilles’ heel for data stewardship organizations in dealing with the practicalities of risk assessment is their lack of knowledge about what data snoopers know.” While much is known or can readily be determined about the contents and coverage of certain public databases maintained by the states, the same cannot be said about the data that are compiled, maintained, and resold by commercial entities. A recent study by the U.S. Government Accountability Office (2013) concluded that “the advent of new and more advanced technologies and changes in the marketplace for consumer information have vastly increased the amount and nature of personal information collected and the number of parties that use or share this information.” The report provides examples of the types of information collected, some of which are discussed in Appendix E.
Although the data that are “out there” in the public domain or held by private sources may be considerable, the threat that it presents is mitigated to at least some degree by discrepancies between the values recorded in these data sources and the values reported by survey respondents or collected by administrative agencies. Such “data divergence,” as described by Duncan et al. (2011), includes not only measurement error but conceptual differences in the way that data elements are defined in different sources. Timing is another factor in data divergence. The data accessible to a would-be intruder and the information reported in federal datasets may be separated by years, which can matter a great deal for health conditions, income, and even geographic location.
c. Record Linkage
One of the most important tools available to the more sophisticated potential intruders is record linkage methodology. When two files contain some of the same individuals, the records common to the two files can be linked if the two files also contain some of the same variables. When the two files contain the same unique and valid numeric identifiers, the records can be linked using “exact matching” on those fields—as is commonly done when files contain Social Security numbers. When the conditions for exact matching are absent but other, non-unique or imperfect identifiers are present, either “probabilistic record linkage” or distance-based matching can be used as an alternative. Probabilistic record linkage separates all possible combinations of records into likely matches, likely non-matches, and a group that cannot be confidently assigned to either and would require a manual “clerical” review to determine in which category they belong. Probabilistic record linkage is often applied to link records based on names and addresses, where duplicate names and spelling errors are possible. Distance-based matching may be used instead of probabilistic record linkage when one or both files contain no explicit identifiers but the two files contain quantitative variables—such as income.
While record linkage is typically applied using variables that appear in both files being linked, it is also possible to use record linkage methods to link files that have no variables in common. For example, one approach relies on correlations between variables. The effectiveness of such matching increases with the strength of the correlation between variables in the two files and the degree of overlap between the two files—that is, the percentage of records in each file that appear in the other file.
15 Limited datasets are not intended to be released as public use files but through licensing or other restricted arrangements.