Privacy and Health Research. Anonymized Data

05/01/1997

Much very useful health research is performed on completely anonymized data. If for a particular research project there are no compelling reasons for retaining at least potential identifiability, anonymized data should be used. Though this injunction might sound unnecessary, it is stated here because often, data with identifiers are used just because they happen already to be on hand in identified form.

Data may be non-identifiable if any of the following tactics have been employed:

  • Identifiers simply have never been collected.
  • Identifiers have been removed ("stripped") effectively.
  • Data have been aggregated—that is, within each data sub-element the data have been averaged or grouped into ranges, and only the averages or ranges reported, not revealing the identity of the data-subjects.
  • Data have been "micro-aggregated," with small randomly assembled clusters of cases averaged, in effect generating a set of pseudo-cases that represent the real population.62

The test of whether data actually are non-identifiable is whether a person without prior knowledge of the data or their collection can, from the data and any other available information (such as postal-code charts, or a casually-held key to a code, or a list of the people recruited to the study), deduce the personal identity of data-subjects.

In an area in which the issue is highly contentious, a "consensus statement" from a workshop on genetic research on stored human tissue samples stated emphatically:63

Samples are anonymous if and only if it is impossible under any circumstances to identify the individual source. At present, in settings such as those involving large population groups, it may be possible to ensure anonymity while retaining some information about the individual source, such as ethnic origin, sex, age cohort, or limited clinical data, with the sample. In other settings, such as DNA samples obtained from a small group of individuals at risk for a specific disorder, retention of additional information may compromise anonymity. Samples are not anonymous if it is possible for any person to link the sample with its source. Even if the researcher cannot identify the source of the tissue, the samples are not anonymous if some other individual or institution has the ability.

If data must be transformed before being released for research—whether into irreversibly anonymized or into key-coded form—characteristics that might indirectly lead to identification of the data-subject should be obscured, blurred, or masked. Residential addresses can be translated into regions. Since some postal zones may be sparsely populated or have a distinctive cast of inhabitants, postal-code identifiers might be avoided. Instead of birthdate, perhaps age, or age brackets, can be used. Instead of the exact number of beds in nursing homes, capacity categories can be used. And personal initials are personal.

The extent to which any transformations are employed should be scaled to the characteristics of the sample and the population of which it is a subset, the potential risks to the data-subjects, the subjects' expectations, and other factors.

Many technical methods of "disclosure limitation" can be applied to make deductive identification of data-subjects difficult, if not impossible. In population studies, for instance, only relatively small proportions of the populations can be sampled. For surveys, only a randomly selected subset of the responses might be released instead of all of the responses, to obviate guessing, by elimination, who said what. And so on.64


(62) Alexander M. Walker, "Generic data," Pharmacoepidemiology and Drug Safety 4, 265–267 (1995).

(63) Page 1787 of Ellen Wright Clayton, Karen K. Steinberg, Muin J. Khoury, Elizabeth Thomson, Lori Andrews, Mary Jo Ellis Kahn, Loretta M. Kopelman, and Joan O. Weiss, "Informed consent for genetic research on stored tissue samples," Journal of the American Medical Association 274, 1786–1792 (1995).

(64) National Research Council, Panel on Confidentiality and Access of the Committee on National Statistics, and the Social Sciences Research Council; George T. Duncan, Thomas B. Jabine, and Virginia de Wolf, editors, Private Lives and Public Policies: Confidentiality and Accessibility of Government Statistics (National Academy Press, Washington, DC, 1993).