It is clear that with the exception of the National Vital Statistics data sets, the Census 2000, and the ACS, the surveys can provide only limited information on race/ethnic subpopulations. The MexicanAmerican samples are adequate in most of the surveys but crossclassifications will rarely be possible for the other groups. Sections 4, 5, and 6 describe ways of enhancing the samples. In this section we discuss what is probably the simplest and least costly way of doing this, that is combining several years of data. The discussion, of course, omits the NVS, Census 2000, and the ACS, since the existing sample sizes are fully adequate.

Annual vs. Surveys Carried out at Period Intervals

Combining years of data is only practical for surveys that are carried out one or more times per year. Some of the surveys are conducted at periodic intervals. Although it would be possible to combine several cycles of such surveys, the length of time covered probably 10 years or more would make the results of doubtful utility. Also, SIPP uses the same households over a number of years, so that combinations of years do not provide much additional information.
The annual surveys for which combinations of years are practical are the CPS (March and monthly), NHIS, NHANES, NIS, MEPS, MCBS, and NHSDA. NHES has been omitted since there is a different emphasis in subject matter each year, so that it falls closer to periodic than annual surveys.
The plans for current NHANES implicitly assume that the detailed analyses of the survey data will be based on averages over a number of years. Each year of current NHANES is based on a representative sample of about 5,000 persons in total, far too few to provide acceptable data for the many agesexrace/ethnicity domains NCHS considers important to study. Combinations of years will be used for analyses of these domains, probably up to 6years for the most detailed groups. In some ways, this can be considered a model for annual averages for other surveys.


Maximum Number of Years for Reasonable Analysis

Section 2.6 of this report pointed out that the maximum number of years for which combined data would be meaningful depended on the specific item. Most health related items and fertility patterns change rather slowly over time, and the most recent 3 to 5year averages will generally reflect current conditions reasonably well. In fact, the NHIS has published 3year average data for Asian and Pacific Islanders (as a combined group), so a precedent exists. Economic statistics, however, are likely to be much more volatile; thus the time period should be considerably shorter. (However, in the absence of any other data, even somewhat outdated information such as a 3year average, will be better than relying on the decennial census as the source of information for the full intercensal period. It is interesting to note that the ACS is planning to combine up to 5 years of data in order to produce reliable, small area data.)
To provide the greatest flexibility for users of this report, we will examine the improvement in precision for three combinations of years 2, 3, and 5 years.


Effective Sample Size for Combined Years

The effective sample sizes for combined years are shown in Tables 41 to 43. It can be seen that except for NHANES, the effective sample sizes for 2 years are a little less than twice the sample for a single year; similarly the 3 and 5 year effective samples are not quite 3 or 5 times the annual sample sizes. All of the surveys use clustered sample designs and a sequence of several years samples are mostly in the same clusters, or in neighboring ones. The lack of independence among several years samples tends to reduce the effective sample size. We have estimated that the reduction in effective sample size over a 2year interval is about 17 percent; the reduction for a 3year period is 25 percent; and the reduction for 5 years is about 35 percent. These come from estimated yeartoyear correlations in the sample: yeartoyear correlations are expected to average .20, 2 years apart correlations are .10, 3 years apart are .07, and 4 years are .05. The current NHANES samples are independent across years, and, therefore, there is no reduction in effective sample size.
The effective sample sizes in Tables 41 to 43 are approximations based on even more assumptions and averages than the numbers in Tables 36 to 38. The sample sizes in each year are subject to sampling errors, and to the vagaries of erratic response rates. This is especially true for the minority subgroups with very small samples; the samples for Cubans or Hawaiian could differ in neighboring years by 10 or 20 percent from the year reflected in our tables. Also, the yeartoyear correlations, resulting from the similarities in characteristics in neighboring households, are average values expected over a set of items, similar to the use of average design effects. Nevertheless, the numbers shown in Tables 41 to 43 indicate the order of magnitude of effective sample sizes and reveal whether useful analyses are possible from each of the data sets.
One feature of the monthly CPS sample should be noted. The monthly CPS includes two kinds of data sets: (1) labor force information and critical demographic items (e.g., age, sex, household relationship, etc.) obtained each month; and (2) supplemental items covered in months other than March. The supplemental items (based on the monthly CPS sample sizes) that are likely to be of greatest interest are number of children ever born, related fertility information, and school enrollment. Voting registration and behavior in the most recent election is obtained every second year, but it is doubtful that combining pairs of years would be meaningful. Voting patterns in presidential and nonpresidential years are very different, and such combinations would probably not be analytically revealing. The entries for CPS in Tables 41 to 43 are restricted to the supplemental items. The annual sample sizes for labor force information, of course, are much larger than the numbers shown, since they are comprised of 12 monthly samples (see Section 4.5.) The supplemental items included in the March interview are based on the same sample as the other monthly supplements, except for Hispanics for whom the sample is doubled.
Since the MEPS and NSFG samples are taken from NHIS respondents, it would be possible to supplement their samples with additional names and addresses from the NHIS. These names and addresses would be a few years old, and thus it may be more convenient to simply combine multiple years of MEPS respondents. The exact timing of these surveys, and the associated costs, would have to be examined before a decision is made on which approach would be preferable.


Surveys Meeting Standards for Precision

Section 3.6 discussed the ability of the surveys to produce reasonable precision in the analysis of the subpopulations or for crossclassifications within these subpopulations. ("Reasonable precision" is based on a subjective judgment of the importance of meeting the various standards described earlier, i.e., CVs of 30 percent, 20 percent, and 10 percent for prevalence rates of .01, .05, .10, .15, and .20.) We will use the same criteria to evaluate the analytic ability of combinations of several years of survey data. As in the case of data for a single year, some studies may need greater precision and others less, and analysts should consider whether they need to modify the summary below.
Three or 5year averages for CPS supplemental items collected in a single month for a given year would provide sample sizes large enough to satisfy analytic needs for most Hispanic subgroups, although only limited crossclassifications would be possible for CubanAmericans. Fairly detailed analyses would be possible for American Indians or Alaska Natives, and for Chinese and Filipinos. Less detailed crossclassifications would be available for most of the other API subgroups, and only simple distributions of Hawaiian would have reasonable reliability.
The NHIS Hispanic sample is quite large, and a 2 or 3year combination will provide quite reliable data, including crossclassifications, for all Hispanic subgroups, and moderately detailed crossclassifications for CubanAmericans. A 5year average will permit quite detailed analysis. A 5year average of the American Indian or Alaska Native data set will satisfy almost all the requirements. A 3year average could be used for Chinese and Filipinos, but 5 years are probably necessary for the other API subgroups.
NHANES has a very large sample of MexicanAmerican and averaging over time will permit fairly detailed crossclassification analyses. The sample was deliberately set up with multiyear averages in mind. None of the other minority subgroups would be helped enough for even simple prevalence rates to have adequate precision.
Table 41.
Effective sample sizes for Hispanic subgroups, using combined years of dataData set Total Mexican
AmericanPuerto
RicanCuban Central or
South AmericanOther Hispanic CPSMarch 2 years 12,537 37,727 1,324 523 1,875 1,086 3 years 16,891 10,411 1,784 704 2,527 1,463 5 years 24,773 15,269 2,617 1,033 3,706 2,145 CPSMonthly 2 years 6,274 3,863 663 262 940 546 3 years 8,453 5,204 893 353 1,267 736 5 years 12,398 7,633 1,310 518 1,858 1,079 NHIS 2 years 24,654 15,441 2,620 1,298 3,495 1,802 3 years 33,217 20,804 3,530 1,748 4,709 2,428 5 years 48,718 30,512 5,178 2,564 6,907 3,561 NIS 2 years 6,232 4,534 511 127 676 386 3 years 8,397 6,109 689 171 911 520 5 years 12,316 8,960 1,010 251 1,337 762 NHANES 2 years 3,164 3,000 48 20 64 32 3 years 4,746 4,500 72 30 96 48 5 years 7,910 7,500 120 50 160 80 MEPS 2 years 7,480 5,080 835 314 1,064 187 3 years 10,078 6,844 1,125 423 1,433 252 5 years 14,781 10,039 1,650 620 2,103 370 MCBS 2 years 705 386 63 102 78 75 3 years 950 520 86 137 106 101 5 years 1,393 762 125 201 155 149 NHSDA 2 years 3,796 2,406 401 160 548 282 3 years 5,114 3,242 540 216 738 380 5 years 7,501 4,755 792 317 1,082 558 NOTE:
The sample cases for each data set reflect the population coverage of the respective surveys. For example, CPSMarch covers all persons in the civilian noninstitutional population, whereas NSFG covers women 15 to 44 years of age. The descriptions of the respective data sets note the appropriate population coverage.Table 42.
Effective sample sizes for API subgroups, using combined years of dataData set Total API Chinese Filipino Japanese Asian Indian Korean Vietnamese Hawaiian Other CPSMarch 2 years 5,072 1,107 947 573 551 539 418 139 630 3 years 6,833 1,492 1,276 792 743 727 563 187 848 5 years 10,022 2,188 1,871 1,132 1,089 1,066 825 274 1,244 CPSMonthly 2 years 5,072 1,107 947 573 551 539 418 139 630 3 years 6,833 1,492 1,276 772 743 727 563 187 848 5 years 10,022 2,188 1,871 1,132 1,089 1,066 825 274 1,244 NHIS 2 years 4,063 934 800 441 396 423 441 139 489 3 years 5,474 1,258 1,078 594 533 569 594 187 659 5 years 8,029 1,848 1,581 871 782 835 871 274 967 NIS 2 years 1,506 341 292 175 169 165 129 43 192 3 years 2,030 459 394 236 227 222 173 59 259 5 years 2,977 673 578 347 333 327 254 86 380 NHANES 2 years 226 54 43 26 25 25 19 6 28 3 years 340 81 65 39 38 38 29 10 42 5 years 566 134 108 64 63 63 48 16 69 MEPS 2 years 596 120 135 50 89 77 35 13 79 3 years 804 162 182 67 107 92 43 18 107 5 years 1,178 237 267 99 176 152 69 26 156 MCBS 2 years 229 52 43 27 25 25 20 7 28 3 years 308 70 59 36 34 34 27 9 38 5 years 452 102 86 53 50 50 40 13 56 NHSDA 2 years 531 120 102 62 58 58 45 15 68 3 years 716 162 137 83 79 79 61 20 92 5 years 1,049 238 201 122 116 116 89 30 135 NOTE:
The sample cases for each data set reflect the population coverage of the respective surveys. For example, CPSMarch covers all persons in the civilian noninstitutional population, whereas NSFG covers women 15 to 44 years of age. The descriptions of the respective data sets note the appropriate population coverage.Table 43.
Effective sample sizes for American Indians or Alaska Natives
using combined years of dataData set American Indian or Alaska Native CPSMarch 2 years 1,782 3 years 2,401 5 years 3,521 CPSMonthly 2 years 1,782 3 years 2,401 5 years 3,521 NHIS 2 years 1,089 3 years 1,467 5 years 2,152 NIS 2 years 591 3 years 797 5 years 1,168 NHANES 2 years 47 3 years 71 5 years 118 MEPS 2 years 299 3 years 403 5 years 591 MCBS 2 years 38 3 years 52 5 years 76 NHSDA 2 years 125 3 years 169 5 years 248 NOTE:
The sample cases for each data set reflect the population coverage of the respective surveys. For example, CPS covers persons in the civilian noninstitutional population, whereas NSFG covers women 15 to 44 years of age. The descriptions of the respective data sets note the appropriate population coverage.The MexicanAmerican samples in the NIS, MEPS, and NHSDA, are fairly large and even 2year combinations will permit fairly detailed crossclassifications. Fiveyear combinations are necessary for most of the other Hispanic subgroups. Five years will permit simple analyses of NIS in most of the API subgroups and for American Indians or Alaska Natives. However, even 5 years is not sufficient for the API subgroups and American Indians or Alaska Natives for MEPS and NHSDA. The MCBS sample of minorities is so small that 5 years fails to satisfy most of the precision requirements, except for MexicanAmericans, for whom simple distributions are possible, but not detailed crossclassifications.


CPS Labor Force Estimates

The sample sizes shown for CPS in Tables 41 through 43, both March and monthly, apply to data obtained in a single month of the year. They include the March supplements income, mobility, work experience, and several other items and the supplemental information covered in other months, particularly school enrollment and fertility, and voting and registration, which is included every other year. However, CPS collects labor force status each month with the sample size shown for CPS Monthly.
Estimates of annual averages of such items as employment, unemployment, occupation, industry, and related labor force items can be produced by combining data for the 12 months of each year. There is a precedent for such annual averages; for many years CPS has produced annual unemployment rates for the larger states.
The number of observations for annual averages are 12 times the numbers for CPS monthly shown in Tables 33 to 35, but the effective sample size is lower. The CPS rotation pattern retains households in the sample for a sequence of 4 months, drops them for the next 8 months, and then reinstates them for another 4month period. As a result, over the course of a year there are multiple observations on most of the sample persons. Furthermore, in the months when a group of sample persons is dropped, most of the sample replacements are neighboring households whose characteristics are usually correlated with the households they replace.
The correlations vary greatly among the labor force items. They are very high for items that tend to persist for most persons over the course of a year, e.g., whether or not in the labor force or employed and for occupation. They are more moderate for unemployment. The U.S. Census Bureau has estimated both the correlations and the effective sample sizes for CPS annual averages.^{1} The results indicate that the effective sample size for annual estimates of the unemployment rate is five times the monthly sample. For most of the other labor force items, the effective sample size is only twice the monthly sample. Estimates of average annual unemployment rates, thus, will be based on effective sample sizes five times as large as the numbers in Tables 36 to 38. Estimates of unemployment rates will satisfy reasonable precision requirements for almost all the minority subgroups. The cost of obtaining annual averages will be quite low since public use files are available.
^{1} Current Population Survey Variance Properties by Gunlicks, Corteville, and Mansur, Proceedings of the Survey Research Methods Section of the 1997 American Statistical Association annual meetings.
