Extending the Utility of Federal Data Bases. Designs for Sample Supplementation


Sample supplementation for small subgroups of the population is generally very expensive. This is not due just to the additional interview and data processing costs, but more so to the effort and cost involved in identifying a probability sample of each subgroup of interest. For example, Cuban-Americans constitute one-half of one percent of the U.S. population, so that with purely simple random selection about 200 households have to be screened to locate a single Cuban household. Most of the API subgroups are even smaller and will require even greater screening. This implies screening of hundreds of thousands of households to locate samples of 1,000 or so supplemental cases. Such an effort could cost several millions of dollars for each survey, depending on the amount of supplementation and the desired level of reliability for the smallest subgroups such as Hawaiian and Vietnamese.

Under some circumstances, it is possible to avoid, or reduce, the very great screening effort. The conditions that permit such reductions are described below.

The samples for two surveys are drawn from sampling frames that show race/ethnicity for each person on the frame. ECLS-B is selected from birth records. The vital statistics records contain the detailed race/ethnicity for almost all births. A few of the smaller API subgroups are only identified as "other API" in states that contain only a small percentage of these subgroups, but Chinese, Japanese, Hawaiian, and Filipinos are reported everywhere, as are all of the Hispanic subgroups. Thus, there would be relatively little additional cost to identify a supplemental sample for ECLS-B, although interviewing and data processing costs might still be substantial, depending on the size of the sample supplementation.

A little more effort would be required to supplement MCBS, but it could be done reasonable efficiently. The MCBS sampling frame consists of Medicare beneficiaries in HCFA files. Race and ethnicity are recorded on this file, but not in the detail required. There is a single code for Hispanics and one code for API. Sample supplementation would require selecting a sample of Hispanics and API, screening the sample (possibly by telephone when listed numbers are available) and subsampling persons within each subgroup. More work is involved than for the ECLS-B, but it can be carried out without excessive cost. American Indians or Alaska Natives are also identified on the MCBS frames so supplementation of this population would be similar to that for ECLS-B.

The sampling frames for the other surveys are mostly area segments, although CPS and SIPP are based on census address lists and NHES and NIS use random digit dialing. In these surveys, the race/ethnicity of the sample households are not known in advance of the household contact, and a screening operation is necessary to identify the units eligible for the supplemental sample. Research on possible methods of reducing screening for samples of relatively rare population subgroups was carried out as part of the development of NHANES III procedures. No single procedure appeared to be universally applicable, but substantial gains in efficiency in sampling for Hispanics was possible by oversampling areas with heavy concentrations of Hispanics reported in the most recent census.1 Further research carried out jointly by Westat and NCHS statisticians confirmed these results and indicated the oversampling rates that would provide the lowest sampling errors.2 Unfortunately, the research indicated that only trivial improvements were possible through geographic oversampling for APIs or American Indians or Alaska Natives, since relatively high proportions of these populations reside in homes that are scattered throughout the general population. The research described above dealt with the broad race/ethnic groups—Hispanics, APIs, and American Indians or Alaska Natives—and did not explore the detailed subgroups. It is likely that geographic oversampling will be almost as effective for most Hispanic subgroups as for total Hispanics. It is possible that a few of the API subgroups are sufficiently clustered for this kind of a sample to be effective, but a more detailed examination would be necessary to determine this fact. In any case, important gains are not possible for most of the API subgroups, or for American Indians or Alaska Natives. For the Hispanic subgroups, even with the gains in efficiency, a sizeable amount of screening would still be necessary.

Members of subpopulations identified through the NIS screener could be asked question modules addressing topics of interest to ASPE. This is the plan formulated by NCHS for the proposed state and local area integrated telephone survey (SLAITS). The NIS annual screening sample is so large that sufficient sample sizes of each subpopulation can be identified yearly; screening costs would be minimal for such data collection efforts. The respondents, of course, would be limited to households with telephones.

The sample design and estimation method used in the Hispanic Health and Nutrition Survey (HHANES) is a useful precedent to consider for sample supplementation. HHANES did not attempt to sample the entire target population that consisted of Mexican-Americans, Cubans, and Puerto Ricans. The HHANES sample was restricted to geographic areas (counties and blocks) containing high concentrations of these subgroups. The sampling frame used for sample selection of PSUs in the Mexican-American sample was restricted to counties with moderate or large numbers of Mexican-Americans or where they constituted reasonably large percentages of the total population. Similarly, the within-PSU sample excluded census block groups or enumeration districts with small numbers of Mexican-Americans. Similar exclusions applied to the Cuban and Puerto Rican samples. The areas in the sampling frames contained well over 80 percent of each subgroup. A model was used to extrapolate the results of the surveys to the total region the data were intended to represent (Southwest for Mexican-Americans, Dade County for Cuban-Americans, and New York City and selected surrounding counties for Puerto Ricans.) The model assumed similar health characteristics for persons inside and outside the areas of heavy concentration of minorities, within specific economic and demographic classes.3

The HHANES estimates appeared plausible, and users did not report any problems with the data. Of course, the modeling accounted for less than 20 percent of the total so that it was unlikely that even important problems with the model would introduce serious errors in the results. Use of models would be much more uncertain for API subgroups or American Indians or Alaska Natives. In 1990, 37 percent of APIs and 47 percent of American Indians or Alaska Natives lived in areas that were under 10 percent minority. Some years after a census, these percentages will be even greater. A procedure similar to HHANES that avoided excessive screening would probably be restricted to no more than 50 percent of APIs and about 40 percent of American Indians or Alaska Natives. The validity of data from models that account for the remaining 50 or 60 percent of the total is open to question.

The sampling research for NHANES III mentioned earlier also explored the use of other kinds of sampling frames, in particular, telephone listings of households with Spanish surname, or distinctive names for other minority groups, and subscribers to foreign language newspapers or magazines. None had high enough coverage to be useful.