Section 3 Table of Contents Section 5

Assessment of Major Federal Data Sets for Analyses of Hispanic and Asian or Pacific Islander Subgroups and Native Americans:
Extending the Utility of Federal Data Bases

4. Combining Data for Several Years

Contents

  1. Annual vs. Surveys Carried out at Periodic Intervals
  2. Maximum Number of Years for Reasonable Analysis
  3. Effective Sample Size for Combined Years
  4. Surveys Meeting Standards for Precision
  5. CPS Labor Force Estimates

It is clear that with the exception of the National Vital Statistics data sets, the Census 2000, and the ACS, the surveys can provide only limited information on race/ethnic subpopulations. The Mexican-American samples are adequate in most of the surveys but cross-classifications will rarely be possible for the other groups. Sections 4, 5, and 6 describe ways of enhancing the samples. In this section we discuss what is probably the simplest and least costly way of doing this, that is combining several years of data. The discussion, of course, omits the NVS, Census 2000, and the ACS, since the existing sample sizes are fully adequate.

[ Go to Contents ]

4.1 Annual vs. Surveys Carried out at Periodic Intervals

Combining years of data is only practical for surveys that are carried out one or more times per year. Some of the surveys are conducted at periodic intervals. Although it would be possible to combine several cycles of such surveys, the length of time covered — probably 10 years or more — would make the results of doubtful utility. Also, SIPP uses the same households over a number of years, so that combinations of years do not provide much additional information.

The annual surveys for which combinations of years are practical are the CPS (March and monthly), NHIS, NHANES, NIS, MEPS, MCBS, and NHSDA. NHES has been omitted since there is a different emphasis in subject matter each year, so that it falls closer to periodic than annual surveys.

The plans for current NHANES implicitly assume that the detailed analyses of the survey data will be based on averages over a number of years. Each year of current NHANES is based on a representative sample of about 5,000 persons in total, far too few to provide acceptable data for the many age-sex-race/ethnicity domains NCHS considers important to study. Combinations of years will be used for analyses of these domains, probably up to 6-years for the most detailed groups. In some ways, this can be considered a model for annual averages for other surveys.

[ Go to Contents ]

4.2 Maximum Number of Years for Reasonable Analysis

Section 2.6 of this report pointed out that the maximum number of years for which combined data would be meaningful depended on the specific item. Most health related items and fertility patterns change rather slowly over time, and the most recent 3 to 5-year averages will generally reflect current conditions reasonably well. In fact, the NHIS has published 3-year average data for Asian and Pacific Islanders (as a combined group), so a precedent exists. Economic statistics, however, are likely to be much more volatile; thus the time period should be considerably shorter. (However, in the absence of any other data, even somewhat outdated information such as a 3-year average, will be better than relying on the decennial census as the source of information for the full intercensal period. It is interesting to note that the ACS is planning to combine up to 5 years of data in order to produce reliable, small area data.)

To provide the greatest flexibility for users of this report, we will examine the improvement in precision for three combinations of years — 2, 3, and 5 years.

[ Go to Contents ]

4.3 Effective Sample Size for Combined Years

The effective sample sizes for combined years are shown in Tables 4-1 to 4-3. It can be seen that except for NHANES, the effective sample sizes for 2 years are a little less than twice the sample for a single year; similarly the 3 and 5 year effective samples are not quite 3 or 5 times the annual sample sizes. All of the surveys use clustered sample designs and a sequence of several years’ samples are mostly in the same clusters, or in neighboring ones. The lack of independence among several years’ samples tends to reduce the effective sample size. We have estimated that the reduction in effective sample size over a 2-year interval is about 17 percent; the reduction for a 3-year period is 25 percent; and the reduction for 5 years is about 35 percent. These come from estimated year-to-year correlations in the sample: year-to-year correlations are expected to average .20, 2 years apart correlations are .10, 3 years apart are .07, and 4 years are .05. The current NHANES samples are independent across years, and, therefore, there is no reduction in effective sample size.

The effective sample sizes in Tables 4-1 to 4-3 are approximations based on even more assumptions and averages than the numbers in Tables 3-6 to 3-8. The sample sizes in each year are subject to sampling errors, and to the vagaries of erratic response rates. This is especially true for the minority subgroups with very small samples; the samples for Cubans or Hawaiian could differ in neighboring years by 10 or 20 percent from the year reflected in our tables. Also, the year-to-year correlations, resulting from the similarities in characteristics in neighboring households, are average values expected over a set of items, similar to the use of average design effects. Nevertheless, the numbers shown in Tables 4-1 to 4-3 indicate the order of magnitude of effective sample sizes and reveal whether useful analyses are possible from each of the data sets.

One feature of the monthly CPS sample should be noted. The monthly CPS includes two kinds of data sets: (1) labor force information and critical demographic items (e.g., age, sex, household relationship, etc.) obtained each month; and (2) supplemental items covered in months other than March. The supplemental items (based on the monthly CPS sample sizes) that are likely to be of greatest interest are number of children ever born, related fertility information, and school enrollment. Voting registration and behavior in the most recent election is obtained every second year, but it is doubtful that combining pairs of years would be meaningful. Voting patterns in presidential and non-presidential years are very different, and such combinations would probably not be analytically revealing. The entries for CPS in Tables 4-1 to 4-3 are restricted to the supplemental items. The annual sample sizes for labor force information, of course, are much larger than the numbers shown, since they are comprised of 12 monthly samples (see Section 4.5.) The supplemental items included in the March interview are based on the same sample as the other monthly supplements, except for Hispanics for whom the sample is doubled.

Since the MEPS and NSFG samples are taken from NHIS respondents, it would be possible to supplement their samples with additional names and addresses from the NHIS. These names and addresses would be a few years old, and thus it may be more convenient to simply combine multiple years of MEPS respondents. The exact timing of these surveys, and the associated costs, would have to be examined before a decision is made on which approach would be preferable.

[ Go to Contents ]

4.4 Surveys Meeting Standards for Precision

Section 3.6 discussed the ability of the surveys to produce reasonable precision in the analysis of the subpopulations or for crossclassifications within these subpopulations. ("Reasonable precision" is based on a subjective judgment of the importance of meeting the various standards described earlier, i.e., CVs of 30 percent, 20 percent, and 10 percent for prevalence rates of .01, .05, .10, .15, and .20.) We will use the same criteria to evaluate the analytic ability of combinations of several years of survey data. As in the case of data for a single year, some studies may need greater precision and others less, and analysts should consider whether they need to modify the summary below.

Three or 5-year averages for CPS supplemental items collected in a single month for a given year would provide sample sizes large enough to satisfy analytic needs for most Hispanic subgroups, although only limited cross-classifications would be possible for Cuban-Americans. Fairly detailed analyses would be possible for American Indians or Alaska Natives, and for Chinese and Filipinos. Less detailed cross-classifications would be available for most of the other API subgroups, and only simple distributions of Hawaiian would have reasonable reliability.

The NHIS Hispanic sample is quite large, and a 2 or 3-year combination will provide quite reliable data, including cross-classifications, for all Hispanic subgroups, and moderately detailed cross-classifications for Cuban-Americans. A 5-year average will permit quite detailed analysis. A 5-year average of the American Indian or Alaska Native data set will satisfy almost all the requirements. A 3-year average could be used for Chinese and Filipinos, but 5 years are probably necessary for the other API subgroups.

NHANES has a very large sample of Mexican-American and averaging over time will permit fairly detailed cross-classification analyses. The sample was deliberately set up with multi-year averages in mind. None of the other minority subgroups would be helped enough for even simple prevalence rates to have adequate precision.

Table 4-1.
Effective sample sizes for Hispanic subgroups, using combined years of data
Data set Total Mexican-
American
Puerto
Rican
Cuban Central or
South American
Other Hispanic
CPS–March
   2 years 12,537 37,727 1,324 523 1,875 1,086
   3 years 16,891 10,411 1,784 704 2,527 1,463
   5 years 24,773 15,269 2,617 1,033 3,706 2,145
CPS–Monthly
   2 years 6,274 3,863 663 262 940 546
   3 years 8,453 5,204 893 353 1,267 736
   5 years 12,398 7,633 1,310 518 1,858 1,079
NHIS
   2 years 24,654 15,441 2,620 1,298 3,495 1,802
   3 years 33,217 20,804 3,530 1,748 4,709 2,428
   5 years 48,718 30,512 5,178 2,564 6,907 3,561
NIS
   2 years 6,232 4,534 511 127 676 386
   3 years 8,397 6,109 689 171 911 520
   5 years 12,316 8,960 1,010 251 1,337 762
NHANES
   2 years 3,164 3,000 48 20 64 32
   3 years 4,746 4,500 72 30 96 48
   5 years 7,910 7,500 120 50 160 80
MEPS
   2 years 7,480 5,080 835 314 1,064 187
   3 years 10,078 6,844 1,125 423 1,433 252
   5 years 14,781 10,039 1,650 620 2,103 370
MCBS
   2 years 705 386 63 102 78 75
   3 years 950 520 86 137 106 101
   5 years 1,393 762 125 201 155 149
NHSDA
   2 years 3,796 2,406 401 160 548 282
   3 years 5,114 3,242 540 216 738 380
   5 years 7,501 4,755 792 317 1,082 558
NOTE:
The sample cases for each data set reflect the population coverage of the respective surveys. For example, CPS-March covers all persons in the civilian noninstitutional population, whereas NSFG covers women 15 to 44 years of age. The descriptions of the respective data sets note the appropriate population coverage.

Table 4-2.
Effective sample sizes for API subgroups, using combined years of data
Data set Total API Chinese Filipino Japanese Asian Indian Korean Vietnamese Hawaiian Other
CPS–March
   2 years 5,072 1,107 947 573 551 539 418 139 630
   3 years 6,833 1,492 1,276 792 743 727 563 187 848
   5 years 10,022 2,188 1,871 1,132 1,089 1,066 825 274 1,244
CPS–Monthly
   2 years 5,072 1,107 947 573 551 539 418 139 630
   3 years 6,833 1,492 1,276 772 743 727 563 187 848
   5 years 10,022 2,188 1,871 1,132 1,089 1,066 825 274 1,244
NHIS
   2 years 4,063 934 800 441 396 423 441 139 489
   3 years 5,474 1,258 1,078 594 533 569 594 187 659
   5 years 8,029 1,848 1,581 871 782 835 871 274 967
NIS
   2 years 1,506 341 292 175 169 165 129 43 192
   3 years 2,030 459 394 236 227 222 173 59 259
   5 years 2,977 673 578 347 333 327 254 86 380
NHANES
   2 years 226 54 43 26 25 25 19 6 28
   3 years 340 81 65 39 38 38 29 10 42
   5 years 566 134 108 64 63 63 48 16 69
MEPS
   2 years 596 120 135 50 89 77 35 13 79
   3 years 804 162 182 67 107 92 43 18 107
   5 years 1,178 237 267 99 176 152 69 26 156
MCBS
   2 years 229 52 43 27 25 25 20 7 28
   3 years 308 70 59 36 34 34 27 9 38
   5 years 452 102 86 53 50 50 40 13 56
NHSDA
   2 years 531 120 102 62 58 58 45 15 68
   3 years 716 162 137 83 79 79 61 20 92
   5 years 1,049 238 201 122 116 116 89 30 135
NOTE:
The sample cases for each data set reflect the population coverage of the respective surveys. For example, CPS-March covers all persons in the civilian noninstitutional population, whereas NSFG covers women 15 to 44 years of age. The descriptions of the respective data sets note the appropriate population coverage.

Table 4-3.
Effective sample sizes for American Indians or Alaska Natives
using combined years of data
Data set American Indian or Alaska Native
CPS–March
   2 years 1,782
   3 years 2,401
   5 years 3,521
CPS–Monthly
   2 years 1,782
   3 years 2,401
   5 years 3,521
NHIS
   2 years 1,089
   3 years 1,467
   5 years 2,152
NIS
   2 years 591
   3 years 797
   5 years 1,168
NHANES
   2 years 47
   3 years 71
   5 years 118
MEPS
   2 years 299
   3 years 403
   5 years 591
MCBS
   2 years 38
   3 years 52
   5 years 76
NHSDA
   2 years 125
   3 years 169
   5 years 248
NOTE:
The sample cases for each data set reflect the population coverage of the respective surveys. For example, CPS covers persons in the civilian noninstitutional population, whereas NSFG covers women 15 to 44 years of age. The descriptions of the respective data sets note the appropriate population coverage.

The Mexican-American samples in the NIS, MEPS, and NHSDA, are fairly large and even 2-year combinations will permit fairly detailed cross-classifications. Five-year combinations are necessary for most of the other Hispanic subgroups. Five years will permit simple analyses of NIS in most of the API subgroups and for American Indians or Alaska Natives. However, even 5 years is not sufficient for the API subgroups and American Indians or Alaska Natives for MEPS and NHSDA. The MCBS sample of minorities is so small that 5 years fails to satisfy most of the precision requirements, except for Mexican-Americans, for whom simple distributions are possible, but not detailed cross-classifications.

[ Go to Contents ]

4.5 CPS Labor Force Estimates

The sample sizes shown for CPS in Tables 4-1 through 4-3, both March and monthly, apply to data obtained in a single month of the year. They include the March supplements — income, mobility, work experience, and several other items — and the supplemental information covered in other months, particularly school enrollment and fertility, and voting and registration, which is included every other year. However, CPS collects labor force status each month with the sample size shown for CPS Monthly.

Estimates of annual averages of such items as employment, unemployment, occupation, industry, and related labor force items can be produced by combining data for the 12 months of each year. There is a precedent for such annual averages; for many years CPS has produced annual unemployment rates for the larger states.

The number of observations for annual averages are 12 times the numbers for CPS monthly shown in Tables 3-3 to 3-5, but the effective sample size is lower. The CPS rotation pattern retains households in the sample for a sequence of 4 months, drops them for the next 8 months, and then reinstates them for another 4-month period. As a result, over the course of a year there are multiple observations on most of the sample persons. Furthermore, in the months when a group of sample persons is dropped, most of the sample replacements are neighboring households whose characteristics are usually correlated with the households they replace.

The correlations vary greatly among the labor force items. They are very high for items that tend to persist for most persons over the course of a year, e.g., whether or not in the labor force or employed and for occupation. They are more moderate for unemployment. The U.S. Census Bureau has estimated both the correlations and the effective sample sizes for CPS annual averages.1 The results indicate that the effective sample size for annual estimates of the unemployment rate is five times the monthly sample. For most of the other labor force items, the effective sample size is only twice the monthly sample. Estimates of average annual unemployment rates, thus, will be based on effective sample sizes five times as large as the numbers in Tables 3-6 to 3-8. Estimates of unemployment rates will satisfy reasonable precision requirements for almost all the minority subgroups. The cost of obtaining annual averages will be quite low since public use files are available.


1  “Current Population Survey Variance Properties” by Gunlicks, Corteville, and Mansur, Proceedings of the Survey Research Methods Section of the 1997 American Statistical Association annual meetings.


Section 3 Table of Contents Section 5


Where to?

Top of Page
Table of Contents of Report

Home Pages:
Human Services Policy (HSP)
Assistant Secretary for Planning and Evaluation (ASPE)
U.S. Department of Health and Human Services (HHS)

Last updated 9/14/00