Privacy and Health Research. Managing Key-Coded Data


Because irreversible anonymization often is undesirable on scientific grounds, the procedures and methods of key-coding of various forms are essential techniques. Some of the practices are very technical. Degree of key-coding or "masking" is relative. It is a question of the extent to which personal identifiability is obscured—which is to say, the impedance against "cracking" of the code and matching the data with the data-subjects.

U.S. agencies, such as the National Heart, Lung, and Blood Institute (NHLBI), emphasize that the first step in protecting personally identifiable data is simply to hold the identifiers close to the point of collection. Before transferring data to other researchers, then, the data should be stripped of identifiers and either key-coded or anonymized. When the Institute sends data to pharmaceutical companies from clinical trials on an investigational new drug, it strips off not only the patient and physician name but location, birthdate, and other data that could point back to the data-subject. It takes similar care when it correlates data from several sources, as when it links heart disease data with socioeconomic data.

Simply designating a reliable person within the research organization to be responsible for stripping identifiers—and formally certifying to the principal investigator and/or an administrator that the resulting set of stripped data is nonidentifiable—can be prudent.

Trusted intermediary organizations, such as public accounting or consulting firms, may be asked to remove identifiers, and perhaps to hold the key linking data with identifiers. For a detailed national analysis of hospital costs based on data provided by the States, the U.S. Agency for Health Care Policy and Research arranged for an intermediary organization to remove identifying information from the patient data, and also information that might identify the hospitals, before the Agency received the data.

In its alcohol related studies, which may be painfully sensitive for the people studied, the U.S. National Institute of Alcohol Abuse and Alcoholism assigns pseudonym (key-coded) identifiers to all subjects and has the key held securely by an independent third party.

The U.S. National Institute for Child Health and Human Development (NICHD) requires that if researchers wish to perform a secondary study on data originally collected by other investigators under an NICHD grant, they must pay a fee to the original researchers to key-code the identifiers and take other protective steps before transferring the data for the secondary study.

The following example illustrates a rigorous approach to separating identifiers from data but retaining the ability to reconnect them if necessary. In several states of Germany an elaborate system is being tested for population-based cancer registries.65 A "trusted office" (Vertrauensstelle), directed by a physician, receives cancer case data from doctors and hospitals, classifies the cases as to type of tumor and so on, and, using cryptographic procedures, assigns pseudonyms, separating the case data from the person-identifying data. Then, using a secure system, it transfers the pseudonymized data to a separately located "registration office" (Registerstelle), which stores the data securely. After a short time the "trusted office" destroys its set of the data. Again separately, a master re-identification key is held by a "supervisory office." The "registration office" cannot match identifiers to the cases it stores. If, later, it becomes scientifically necessary to trace back to the patient's physician to obtain more information, with the approval of an ethics committee the supervising office can use its re-identification key to reassociate the case data with the identifying data. The system has been endorsed in the relevant laws. Whether such a system will be widely applicable is not yet clear; but such approaches deserve to be evaluated.

(65) K. Pommerening, M. Miller, I. Schmidtmann, and J. Michaelis, "Pseudonyms for cancer registries," Methods of Information in Medicine 35, 112–121 (1996).