Public use microdata play a critical role in research and policy analysis. Exploratory research and many types of policy analysis do not lend themselves well to the conditions that govern restricted access as described above. The creation of public use data that protect the confidentiality of the subjects begins with de-identification, but depending on the contents of the data and the characteristics of the subjects, it may require the application of a number of additional techniques to reduce the risk of re-identification to a satisfactory level
HIPAA codified a de-identification process for health records that includes the removal of 18 specific direct and indirect identifiers, which are listed in Appendix D. HIPAA requirements apply to a narrow range of datasets, but most of these identifiers have relevance outside of the health data that fall under the HIPAA regulations. The protections mandated by HIPAA go well beyond the simpler de-identification practices that were common in unregulated health data prior to HIPAA.
b. The Concept of k-Anonymity
To protect the individuals in a dataset from re-identification, one must be certain that the characteristics reported on the file do not define unique individuals in a separate, identified database that is accessible to potential intruders. In theory, the way to achieve this level of protection is to ensure that no combination of characteristics is shared by fewer than some minimum number of persons in the population. This concept is called “k-anonymity,” where k is the chosen minimum number (Sweeney 2002). This is a fundamental concept in protecting public use data from disclosure (Ciriani et al. 2007, El Emam and Dankar 2008). 12 If the data producer has access to a population database containing characteristics that will be reported on the public use file, the application of k-anonymity as a principle of disclosure limitation is straightforward and rigorous. Characteristics that in combination define unique individuals can be altered so that, when combined, they point to no fewer than k people. Typically, however, the data producer is not able to access population data for this purpose and applies k-anonymity to the file that is to be released. This is a conservative approach in that it yields more protection than is necessary to achieve k-anonymity at the population level. However, when the files to which it is applied contain a non-trivial proportion of the population, it may not be excessively conservative.
c. Statistical Disclosure Limitation
More generally, statistical disclosure limitation encompasses a wide range of techniques for reducing detail, modifying data values, or creating alternative data values to minimize the likelihood of a successful re-identification and, secondarily, lessen the information that would be gained if a re-identification were actually accomplished. The techniques that federal agencies apply tend to vary across the agencies, and within an agency they are likely to vary with the dataset, as different datasets present different challenges, depending on the type of information collected, the depth of the information recorded, and the design of the sample. The next section provides an overview of the methods commonly employed for statistical disclosure limitation.
12 The Office for Civil Rights (OCR) guidance on methods for de-identification of data under the HIPAA Privacy Rule discusses k-anonymity as a principle that can be applied to protect health data under HIPAA (OCR 2012).