Comments: Several commenters called for clarification of proposed language in the NPRM that would have permitted a covered entity to treat information as de-identified, even if specified identifiers were retained, as long as the probability of identifying subject individuals would be very low. Commenters expressed concern that the "very low" standard was vague. These comments expressed concern that covered entities would not have a clear and easy way to know when information meets this part of the standard.
Response: We agree with the comments that covered entities may need additional guidance on the types of analyses that they should perform in determining when the probability of re-identification of information is very low. We note that in the final rule, we reformulate the standard somewhat to require that a person with appropriate knowledge and experience apply generally accepted statistical and scientific methods relevant to the task to make a determination that the risk of re-identification is very small. In this context, we do not view the difference between a very low probability and a very small risk to be substantive. After consulting representatives of the federal agencies that routinely de-identify and anonymize information for public release 16, we attempt here to provide some guidance for the method of de-identification.
As requested by some commenters, we include in the final rule a requirement that covered entities (not following the safe harbor approach) apply generally accepted statistical and scientific principles and methods for rendering information not individually identifiable when determining if information is de-identified. Although such guidance will change over time to keep up with technology and the current availability of public information from other sources, as a starting point the Secretary approves the use of the following as guidance to such generally accepted statistical and scientific principles and methods:
(1) Statistical Policy Working Paper 22 - Report on Statistical Disclosure Limitation Methodology (http://www.fcsm.gov/working-papers/wp22.html) (prepared by the Subcommittee on Disclosure Limitation Methodology, Federal Committee on Statistical Methodology, Office of Management and Budget) and
(2) the Checklist on Disclosure Potential of Proposed Data Releases (http://www.fcsm.gov/docs/checklist_799.doc) (prepared by the Confidentiality and Data Access Committee, Federal Committee on Statistical Methodology, Office of Management and Budget).
We agree with commenters that such guidance will need to be updated over time and we will provide such guidance in the future.
According to the Statistical Policy Working Paper 22, the two main sources of disclosure risk for de-identified records about individuals are the existence of records with very unique characteristics (e.g., unusual occupation or very high salary or age) and the existence of external sources of records with matching data elements which can be used to link with the de-identified information and identify individuals (e.g., voter registration records or driver's license records). The risk of disclosure increases as the number of variables common to both types of records increases, as the accuracy or resolution of the data increases, and as the number of external sources increases. As outlined in Statistical Policy Working Paper 22, an expert disclosure analysis would also consider the probability that an individual who is the target of an attempt at re-identification is represented on both files, the probability that the matching variables are recorded identically on the two types of records, the probability that the target individual is unique in the population for the matching variables, and the degree of confidence that a match would correctly identify a unique person.
Statistical Policy Working Paper 22 also describes many techniques that can be used to reduce the risk of disclosure that should be considered by an expert when de-identifying health information. In addition to removing all direct identifiers, these include the obvious choices based on the above causes of the risk; namely, reducing the number of variables on which a match might be made and limiting the distribution of the records through a "data use agreement" or "restricted access agreement" in which the recipient agrees to limits on who can use/receive the data. The techniques also include more sophisticated manipulations: recoding variables into fewer categories to provide less precise detail (including rounding of continuous variables); setting top-codes and bottom-codes to limit details for extreme values; disturbing the data by adding noise by swapping certain variables between records, replacing some variables in random records with mathematically imputed values or averages across small random groups of records, or randomly deleting or duplicating a small sample of records; and replacing actual records with synthetic records that preserve certain statistical properties of the original data.