Extending the Utility of Federal Data Bases. Summary of Findings


  • Most of the databases show the race/ethnic identification of each person in sufficient detail to permit subgroup analysis, but the full detail is missing in a few surveys. All statistical agencies are expected to convert to the new race/ethnic classifications within the next few years. Thus, this would be the appropriate period in which to attempt to get uniformity in the detailed race/ethnicity codes to be entered in the data records, if ASPE believes this would permit useful improvements in Federal statistics.
  • The National Vital Statistics data sets and the 100 percent data from the decennial censuses are not subject to sampling errors for descriptive analyses, and there are therefore no impediments to subgroup analysis. The long form data in Census 2000 and the ACS are based on such large samples that analyses could be carried out on even very small subgroups with the results subject to only trivial sampling errors.
  • None of the other surveys provide sufficient precision to permit sophisticated analysis of all subgroups. The larger data setsВ  CPS-March, NHIS, and NHESВ  contain adequate samples of Mexican-Americans, and analyses based on cross-classification are possible. However, only simple distributions could be carried out reliably for most of the other race/ethnic subgroups. CPS-Monthly, SIPP, NIS, and MEPS also provide satisfactory data for Mexican-Americans, but even simple distributions for most of the other subgroups would have poor reliability. In the other surveys, only limited analysis of some of the larger subgroups could be carried out with any confidence.
  • Multi-year averages, of course, would improve the precision. Five-year averages will provide samples large enough to satisfy analytic needs for most Hispanic subgroups for the larger data sets, i.e., CPS and NHIS. Three-year averages in the current NHANES would provide reasonably precise data for Mexican-Americans, and 5 or 6-year averages would permit analyses of detailed age-sex classes. However, in all surveys the Cuban sample and the samples of most of the API subgroups would still be too small for anything but simple analyses. The other data sets would also be improved by averaging over time, but the effective sample sizes of many subgroups would still be small.
  • It is probably not practical to obtain multi-year averages for the periodic (as distinct from annual) surveys. These comprise NSFG, SIPP, ECLS-B, and ECLS-K. We also include NHES in this category since, although it is annual, the main data content varies from year to year.
  • Annual averages of unemployment rates for each subgroup in CPS would have reasonable precision and could be obtained with relatively little effort. Annual averages for other labor force items would be only a little better than monthly statistics.
  • There are a few items that appear on more than one survey, and combining the results would improve precision. However, this is a fairly rare occurrence and can satisfy only limited data needs.
  • If the U.S. Census Bureau goes ahead with its plans for the ACS (currently scheduled to start in the year 2003), it could be a major resource for subgroup analysis. First, the ACS will be able to supply annual statistics on a variety of demographic, social, and economic characteristics for each subgroup. Secondly, it could become the vehicle for obtaining much needed information for these groups, either through the addition of questions to the ACS, or through a special effort which used the ACS as a source of sample. Finally, it could become the sampling frame for the selection of supplemental samples for other surveys, substantially reducing the cost of sample supplementation. However, in such cases, a number of bureaucratic hurdles would have to be overcome. Whether this could be done to the satisfaction of both the U.S. Census Bureau and the sponsoring agencies is uncertain.
  • Sample supplementation for most surveys will be quite expensive if use of the ACS is not practical. Statisticians have developed devices for reducing the sampling and screening costs for small population groups, but a considerable amount of screening would still be required. Also, it is unlikely that the devices would be effective for all subgroups.
  • We would like to repeat the caveats mentioned earlier in this report:
    1. The sample sizes provided in Tables 3-3 to 3-5 which were used to estimate effective sample sizes and to ascertain whether surveys achieved reasonable standards of precision, refer to specific time periods (reported in Section 1.2). The samples in most Federal surveys are fairly stable, but changes are made from time to time. Although small changes in sample size in the order of 10 or 15 percent will have only a negligible effect on the conclusions drawn in this report, much larger revisions occasionally occur. Before going ahead with a study of a subgroup in a particular survey, the analyst should refer to the documentation for the survey to see whether the sample sizes in Tables 3-3 to 3-5 are still applicable. Any important changes in the sample should be taken into account.
    2. The sample sizes in this report refer to each surveys total sample for the race/ethnic subgroup. When the analysis is restricted to a subclass of the total (e.g., all males, or females, persons in a specific age group, etc.) the sample size should be adjusted accordingly.
    3. In a few surveys, a subsample is used for some variables. For example, NHIS frequently collects selected information from only one person in each sample household. Similarly, NHANES uses random subsamples of the full sample for some items. An analyst should ascertain whether or not the full sample is used for the variables of interest, and determine whether the sample sizes in Tables 3-3 to 3-5 are appropriate.
    4. The design effects reported in Table 3-2, which are necessary for the estimation of effective sample sizes, are averages over a broad set of items, and reflect variables for which correlations among household members are not excessive. There are some items for which almost all household members have the same value, e.g., presence of health insurance, poverty status, urban-rural residence, region of residence. The design effects are much larger for such items. Section 3.5 discusses methods of dealing with such situations.
    5. Finally, it is important to recognize that considerable "noise" is to be found in the statistics. For example, small differences in reporting of race/ethnicity among some of the databases, minor variations in sample size from year to year even when there are no changes in sample design, and the use of average design effects, which do not reflect the variation among items, are all sources of "noise." As a result, the conclusions drawn in this report should be considered as approximations, but are sufficiently accurate as to be a useful guide on the kinds of analyses of race/ethnic subgroups that are possible with the various databases.