Comments: Many commenters argued that stripping all 19 identifiers is unnecessary for purposes of de-identification. They felt that such items as zip code, city (or county), and birth date, for example, do not identify the individual and only such identifiers as name, street address, phone numbers, fax numbers, email, Social Security number, driver's license number, voter registration number, motor vehicle registration, identifiable photographs, finger prints, voice prints, web universal resource locator, and Internet protocol address number need to be removed to reasonably believe that data has been de-identified.
Other commenters felt that removing the full list of identifiers would significantly reduce the usefulness of the data. Many of these comments focused on research and, to a lesser extent, marketing and undefined "statistical analysis." Commenters who represented various industries and research institutions expressed concern that they would not be able to continue current activities such as development of service provider networks, conducting "analysis" on behalf of the plan, studying use of medication and medical devices, community studies, marketing and strategic planning, childhood immunization initiatives, patient satisfaction surveys, and solicitation of contributions. The requirements in the NPRM to strip off zip code and date of birth were of particular concern. These commenters stated that their ability to do research and quality analysis with this data would be compromised without access to some level of information about patient age and/or geographic location.
Response: While we understand that removing the specified identifiers may reduce the usefulness of the resulting data to third parties, we remain convinced by the evidence found in the MIT study that we referred to in the preamble to the proposed rule 17 and the analyses discussed below that there remains a significant risk of identification of the subjects of health information from the inclusion of indirect identifiers such as birth date and zip code and that in many cases there will be a reasonable basis to believe that such information remains identifiable. We note that a covered entity not relying on the safe harbor may determine that information from which sufficient other identifiers have been removed but which retains birth date or zip code is not reasonably identifiable. As discussed above, such a determination must be made by a person with appropriate knowledge and expertise applying generally accepted statistical and scientific methods for rendering information not identifiable.
Although we have determined that all of the specified identifiers must be removed before a covered entity meets the safe harbor requirements, we made modifications in the final rule to the specified identifiers on the list to permit some information about age and geographic area to be retained in de-identified information.
For age, we specify that, in most cases, year of birth may be retained, which can be combined with the age of the subject to provide sufficient information about age for most uses. After considering current and evolving practices and consulting with federal experts on this topic, including members of the Confidentiality and Data Access Committee of the Federal Committee on Statistical Methodology, Office of Management and Budget, we concluded that in general, age is sufficiently broad to be allowed in de-identified information, although all dates that might be directly related to the subject of the information must be removed or aggregated to the level of year to prevent deduction of birth dates. Extreme ages -- 90 and over -- must be aggregated further (to a category of 90+, for example) to avoid identification of very old individuals (because they are relatively rare). This reflects the minimum requirement of the current recommendations of the Bureau of the Census. 18 For research or other studies relating to young children or infants, we note that the rule would not prohibit age of an individual from being expressed as an age in months, days, or hours.
For geographic area, we specify that the initial three digits of zip codes may be retained for any three-digit zip code that contains more than 20,000 people as determined by the Bureau of the Census. As discussed more below, there are currently only 18 three-digit zip codes containing fewer than 20,000 people. We note that this number may change when information from the 2000 Decennial Census is analyzed.
In response to concerns expressed in the comments about the need for information on geographic area, we investigated the potential of allowing 5-digit zip codes or 3-digit zip codes to remain in the de-identified information. According to 1990 Census data, the populations in geographical areas delineated by 3-digit zip codes vary a great deal, from a low of 394 to a high of 3,006,997, with an average size of 282,304. There are two 3-digit zip codes containing fewer than 500 people and six 3-digit zip codes containing fewer than 10,000 people each. 19 Of the total of 881 3-digit zip codes, there are 18 with fewer than 20,000 people, 71 with fewer than 50,000 people, and 215 containing fewer than 100,000 population. We also looked at two-digit zip codes (the first 2 digits of the 5-digit zip code) and found that the smallest of the 98 2-digit zip codes contains 188,638 people.
We also investigated the practices of several other federal agencies which are mandated by Congress to release data from national surveys while preserving confidentiality and which have been dealing with these issues for decades. The problems and solutions being used by these agencies are laid out in detail in the Statistical Policy Working Paper 22 cited earlier.
To protect the privacy of individuals providing information to the Bureau of Census, the Bureau has determined that a geographical region must contain at least 100,000 people.20 This standard has been used by the Bureau of the Census for many years and is supported by simulation studies using Census data. 21 These studies showed that after a certain point, increasing the size of a geographic area does not significantly decrease the percentage of unique records (i.e., those that could be identified if sampled), but that the point of diminishing returns is dependent on the number and type of demographic variables on which matching might occur. For a small number of demographic variables (6), this point was quite low (about 20,000 population), but it rose quickly to about 50,000 for 10 variables and to about 80,000 for 15 variables. The Bureau of the Census releases sets of data to the public that it considers safe from re-identification because it limits geographical areas to those containing at least 100,000 people and limits the number and detail of the demographic variables in the data. At the point of approximately 100,000 population, 7.3% of records were unique (and therefore potentially identifiable) on 6 demographic variables from the 1990 Census Short Form: age in years (90 categories), race (up to 180 categories), sex (2 categories), relationship to householder (14 categories), Hispanic (2 categories), and tenure (owner vs. renter in 5 categories). Using 6 variables derived from the Long Form data, age (10 categories), race (6 categories), sex (2 categories), marital status (5 categories), occupation (54 categories), and personal income (10 categories), raised the percentage to 9.8%.
We also examined the results of an NCHS simulation study using national survey data 22 to see if some scientific support could be found for a compromise. The study took random samples from populations of different sizes and then compared the samples to the whole population to see how many records were identifiable, that is, matched uniquely to a unique person in the whole population on the basis of 9 demographic variables: age (85 categories), race (4 categories), gender (2 categories), ethnicity (2 categories), marital status (3 categories), income (3 categories), employment status (2 categories), working class (4 categories), and occupation (42 categories). Even when some of the variables are aggregated or coded, from the perspective of a large statistical agency desiring to release data to the public, the study concluded that a population size of 500,000 was not sufficient to provide a reasonable guarantee that certain individuals could not be identified. About 2.5 % of the sample from the population of 500,000 was uniquely identifiable, regardless of sample size. This percentage rose as the size of the population decreased, to about 14% for a population of 100,000 and to about 25% for a population of 25,000. Eliminating the occupation variable (which is less likely to be found in health data) reduced this percentage significantly to about 0.4 %, 3%, and 10% respectively. These percentages of unique records (and thus the potentials for re-identification) are highly dependent on the number of variables (which must also be available in other databases which are identified to be considered in a disclosure risk analysis), the categorical breakdowns of those variables, and the level of geographic detail included.
With respect to how we might clarify the requirement to achieve a "low probability" that information could be identified, the Statistical Policy Working Paper 22 referenced above discusses the attempts of several researchers to define mathematical measures of disclosure risk only to conclude that "more research into defining a computable measure of risk is necessary." When we considered whether we could specify a maximum level of risk of disclosure with some precision (such as a probability or risk of identification of < 0.01), we concluded that it is premature to assign mathematical precision to the "art" of de-identification.
After evaluating current practices and recognizing the expressed need for some geographic indicators in otherwise de-identified databases, we concluded that permitting geographic identifiers that define populations of greater than 20,000 individuals is an appropriate standard that balances privacy interests against desirable uses of de-identified data. In making this determination, we focused on the studies by the Bureau of Census cited above which seemed to indicate that a population size of 20,000 was an appropriate cut off if there were relatively few (6) demographic variables in the database. Our belief is that, after removing the required identifiers to meet the safe harbor standards, the number of demographic variables retained in the databases will be relatively small, so that it is appropriate to accept a relatively low number as a minimum geographic size.
In applying this provision, covered entities must replace the (currently 18) forbidden 3-digit zip codes with zeros and thus treat them as a single geographic area (with > 20,000 population). The list of the forbidden 3-digit zip codes will be maintained as part of the updated Secretarial guidance referred to above. Currently, they are: 022, 036, 059, 102, 203, 555, 556, 692, 821, 823, 830, 831, 878, 879, 884, 893, 987, and 994. This will result in an average 3-digit zip code area population of 287,858 which should result in an average of about 4% unique records using the 6 variables described above from the Census Short Form. Although this level of unique records will be much higher in the smaller geographic areas, the actual risk of identification will be much lower because of the limited availability of comparable data in publically available, identified databases, and will be further reduced by the low probability that someone will expend the resources to try to identify records when the chance of success is so small and uncertain. We think this compromise will meet the current need for an easy method to identify geographic area while providing adequate protection from re-identification. If a greater level of geographical detail is required for a particular use, the information will have to be obtained through another permitted mechanism or be subjected to a specific de-identification determination as described above. We will monitor the availability of identified public data and the concomitant re-identification risks, both theoretical and actual, and adjust this safe harbor in the future as necessary.
As we stated above, we understand that many commenters would prefer a looser standard for determining when information is de-identified, both generally and with respect to the standards for identifying geographic area. However, because public databases (such as voter records or driver's license records) that include demographic information about a geographically defined population are available, a surprisingly large percentage of records of health information that contain similar demographic information can be identified. Although the number of these databases seems to be increasing, the number of demographic variables within them still appears to be fairly limited. The number of cases of privacy violation from health records which have been identified in this way is small to date. However, the risk of identification increases with decreasing population size, with increasing amounts of demographic information (both in level of detail and number of variables), and with the uniqueness of the combination of such information in the population. That is, an 18 year old single white male student is not at risk of identification in a database from a large city such as New York. However, if the database were about a small town where most of the inhabitants were older, retired people of a specific minority race or ethnic group, that same person might be unique in that community and easily identified. We believe that the policy that we have articulated reaches the appropriate balance between reasonably protecting privacy and providing a sufficient level of information to make de-identified databases useful.
Comments: Some comments noted that identifiers that accompany photographic images are often needed to interpret the image and that it would be difficult to use the image alone to identify the individual.
Response: We agree that our proposed requirement to remove all photographic images was more than necessary. Many photographs of lesions, for example, which cannot usually be used alone to identify an individual, are included in health records . In this final rule, the only absolute requirement is the removal of full-face photographs, and we depend on the "catch-all" of "any other unique ... characteristic ..." to pick up the unusual case where another type of photographic image might be used to identify an individual.
Comments: A number of commenters felt that the proposed bar for removal had been set too high; that the removal of these 19 identifiers created a difficult standard, since some identifiers may be buried in lengthy text fields.
Response: We understand that some of the identifiers on our list for removal may be buried in text fields, but we see no alternative that protects privacy. In addition, we believe that such unstructured text fields have little or no value in a de-identified information set and would be removed in any case. With time, we expect that such identifiers will be kept out of places where they are hard to locate and expunge.
Comments: Some commenters asserted that this requirement creates a disincentive for covered entities to de-identify data and would compromise the Secretary's desire to see de-identified data used for a multitude of purposes. Others stated that the 'no reason to believe' test creates an unreasonable burden on covered entities, and would actually chill the release of de-identified information, and set an impossible standard.
Response: We recognize that the proposed standards might have imposed a burden that could have prevented the widespread use of de-identified information. We believe that our modifications to the final rule discussed above will make the process less burdensome and remove some of the disincentive. However, we could not loosen the standards as far as many commenters wanted without seriously jeopardizing the privacy of the subjects of the information. As discussed above, we modify the "no reason to know" standard that was part of the safe harbor provision and replace it in the final rule with an "actual knowledge" standard. We believe that this change provides additional certainty to covered entities using the safe harbor and should eliminate any chilling effect.
Comments: Although most commenters wanted to see data elements taken off the list, there were a small number of commenters that wanted to see data items added to the list. They believed that it is also necessary to remove clinical trial record numbers, device model serial numbers, and all proper nouns from the records.
Response: In response to these requests, we have slightly revised the list of identifiers that must be removed under the safe harbor provision. Clinical trial record numbers are included in the general category of "any other unique identifying number, characteristic, or code." These record numbers cannot be included with de-identified information because, although the availability of clinical trial numbers may be limited, they are used for other purposes besides de-identification/re-identification, such as identifying clinical trial records, and may be disclosed under certain circumstances. Thus, they do not meet the criteria in the rule for use as a unique record identifier for de-identified records. Device model serial numbers are included in "any device identifier or serial number" and must be removed. We considered the request to remove all proper nouns to be very burdensome to implement for very little increase in privacy and likely to be arbitrary in operation, and so it is not included in the final rule.