Following Dwork and Naor (2010), the data that a potential intruder would use to re-identify records on a public use file may be described as auxiliary data. The challenge in protecting a public use file from any possibility of re-identification is the inability to guarantee that there are no auxiliary data in anyone’s possession that would enable re-identification of even one record.5 To assess the level of risk, however, some understanding of the nature and amount of information that a potential intruder could access to re-identify individuals in a database is critical.
Purdam and Elliot (2002) conducted a review of public data sources in Europe to determine what data could be used, potentially, to re-identify records in data released by government agencies. A comparable assessment for the U.S. does not exist although numerous papers have been written that discuss various sources of data.
Benitez and Malin (2009) estimated the risk of re-identification from voter registration lists by state for datasets protected by the HIPAA Safe Harbor and Limited Dataset policies. Because voter registration lists vary in cost, the study addresses both the probability of successful re-identification, given the attempt, and cost factors that may affect the likelihood of an attempt. In their study, the Safe Harbor dataset included year of birth, gender, and race while the Limited Dataset policy added county and date of birth. The results showed wide variation in estimated risk and in the unit price of each potential re-identification by state, with substantially greater risk under the Limited Dataset versus the Safe Harbor policy. They concluded that blanket protection policies expose different organizations (in different states) to differential disclosure risk.
Barth-Jones (2012) cautions that voter registration lists may exclude a significant proportion of the population. In Cambridge, Massachusetts, when Governor Weld was re-identified, voter registration records covered about half of the adult population. The implication is that a set of characteristics that is unique in an area’s voter registration records may not be unique within the entire population. One cannot know that from just the voter records, however. To re-identify individuals within a small geographic area using simple demographic characteristics, one would need, in effect, a population register containing such characteristics for all individuals in the population.
Duncan et al. (2011) observe that “the Achilles’ heel for data stewardship organizations in dealing with the practicalities of risk assessment is their lack of knowledge about what data snoopers know.” While much is known or can readily be determined about the contents and coverage of certain public databases maintained by the states, the same cannot be said about the data that are compiled, maintained, and resold by commercial entities. A recent study by the U.S. Government Accountability Office (2013) concluded that “the advent of new and more advanced technologies and changes in the marketplace for consumer information have vastly increased the amount and nature of personal information collected and the number of parties that use or share this information.” The report provides examples of the types of information collected. For instance, in addition to individual and household demographic information, Acxiom collects household wealth indicators such as estimated ranges of household income, indicators of income-producing assets, and estimated ranges of net worth. More precise information includes the year, make, and model of household vehicles and household life event indicators. Experian’s data include a variety of physical ailments such as diabetes, high blood pressure, high cholesterol, and visual impairments; types of financial investments; and consumption tastes. The detailed contents are known by their internal users and, perhaps to a lesser degree, those who have bought data extracts, but this information—and, especially, the quality of the data—is not readily accessible to researchers seeking ways to protect federal data from re-identification. Some government agencies have purchased sets of data from these sources for their own research purposes—not only to learn how to better protect the confidentiality of their data from such sources but as an additional data source that might be useful for nonresponse adjustment and imputation, and for reduction in nonsampling error generally.
Although the data that are “out there” in the public domain or held by private sources may be considerable, the threat that it presents is mitigated to at least some degree by discrepancies between the values recorded in these data sources and the values reported by survey respondents or collected by administrative agencies. Such “data divergence,” as described by Duncan et al. (2011), includes not only measurement error but conceptual differences in the way that data elements are defined in different sources. For example, surveys that collect income data usually ask for gross earnings, but the earnings data collected by the Internal Revenue Service—some of which may end up in commercial databases by way of loan applications—are taxable earnings, which can be considerably less than gross earnings. Timing is another factor in data divergence. The data accessible to a would-be intruder and the information reported in federal datasets may be separated by years, which can matter a great deal for health conditions, income, and even geographic location.
5 Combining this understanding of auxiliary data with Dalenius’s assertion that access to a statistical database should not enable one to learn anything about an individual that could not be learned without access, Dwork and Smith (2009) explain the concept of differential privacy, which views disclosure risk in a different light. They argue that with the right auxiliary information, an intruder could learn something about an individual whether or not that individual was included in a particular database. Differential privacy compares the risk that an individual encounters by being included versus not included in that database.