Sweeney was able to re-identify a large proportion of the individuals in the Massachusetts hospital discharge data because the combination of ZIP code, sex, and date of birth in many cases pointed to unique individuals in the city of Cambridge voter registration records, which provided their names. In the language of the confidentiality literature, these characteristics defined “population uniques.” To protect the individuals in a dataset from re-identification, one must be certain that the characteristics reported on the file do not define unique individuals in a separate, identified database that is accessible to potential intruders.
In theory, the way to achieve this level of protection is to ensure that no combination of characteristics is shared by fewer than some minimum number of persons in the population. This concept is called “k-anonymity,” where k is the chosen minimum number (Sweeney 2002).9 This is a fundamental concept in protecting public use data from disclosure (Ciriani et al. 2007, El Emam and Dankar 2008). If the data producer has access to a population database containing characteristics that will be reported on the public use file, the application of k-anonymity is straightforward and rigorous. Characteristics that in combination define unique individuals can be altered so that, when combined, they point to no fewer than k people. Typically, however, the data producer is not able to access population data for this purpose and applies k-anonymity to the file that is to be released. This is a conservative approach in that it yields more protection than is necessary to achieve k-anonymity at the population level. However, when the files to which it is applied contain a non-trivial proportion of the population, it may not be excessively conservative.
When datasets include only a representative sample of the population, the individuals with unique combinations of characteristics are called “sample uniques.” To apply k-anonymity in the least conservative way, the data producer would need to know when a sample unique is also a population unique. This is a challenging problem. For example, if a sample is selected with a probability of 1 in 1,000 (which is a relatively high sampling rate for a federal survey), the average respondent represents 1,000 people. If a respondent has a unique combination of the characteristics that would appear in, say, a voter registration database, that combination of characteristics is about as likely to be found in 2,000 people as in just one person in the population. There is a literature on determining or inferring population uniqueness from sample uniqueness (see, for example, Skinner and Elliot 2002), but such formal techniques are not yet widely applied.
9 Sweeney credits Pierangela Samarati with naming k-anonymity.