Extending the Utility of Federal Data Bases. Nominal and Effective Sample Sizes

05/01/2000

Tables 3-3, 3-4, and 3-5 show the sample sizes for all of the race/ethnic subgroups and are the same numbers reported in Tables A-1 to A-3 of the Task 2 report. As noted, these data represent approximations of the number of sample cases for each subpopulation, and were obtained either from published reports of the Federal agencies sponsoring the survey, provided by the agencies, or derived by Westat. We refer to these numbers as the "nominal sample sizes" to distinguish them from the effective sample sizes. We note that we have included all race/ethnic subgroups, including those that are not currently identified in the data set. The sources used to provide estimates of design effects are shown in Appendix A.

 

Table 3-2.
Average design effects1 for minorities
Survey Average design effect
Census
    Census 2000 1.0
    ACS 1.0
    CPS2 1.5
    SIPP  Hispanic 2.4
    SIPP  API and American Indians or Alaska Natives 1.6
NCHS/CDC
    NHIS  Hispanics and American Indians or Alaska Natives 1.5
    NHIS  API 1.3
    NSFG Hispanics and American Indians or Alaska Natives 1.7
    NSFG API 1.4
    NIS 1.3
    NHANES3  Mexican-American 2.2
    NHANES3  Other minorities 1.8
AHRQ
    MEPS  Hispanic 1.2
    MEPS  API and American Indians or Alaska Natives 2.1
HCFA
    MCBS 1.1
SAMHSA
    NHSDA 2.2
NCES
    NHES4 1.4
    ECLS-B 1.2
    ECLS-K5 2.5
1 Most of the surveys are based on household samples. The design effects apply to statistics that do not cluster strongly within households, e.g., health conditions, educational attainment, and labor force status. Items like poverty status, availability of health insurance, urbanrural residence, etc. generally are identical for all members of a household, and the design effects for such items are much larger, usually two to three times the ones shown in the table.
2 The design effects are approximately the same for the March CPS and for other months.
3 The design effects are those of statistics on data for the total of each race/ethnic group. Design effects for individual agesex groups are lower.
4 The design effects shown apply to statistics on children who constitute the main focus of NHES. Data for adults are sometimes included in the survey, and they are subject to higher design effects.
5 The design effect shown, 2.5, applies to most social, economic, and related items. The design effect for test scores is about 5.

Many of the U.S. Government surveys are repetitive, that is either carried out every year, conducted several times a year, or as in the case of CPS, conducted every month. In most cases, the sample sizes shown in this report describe the annual sample as it was in the time period noted. The reader should be aware that sample sizes are sometimes changed because of budgetary restrictions or other causes. For analysis of a data set, it would be useful to ascertain whether there is an important difference in the sample design between the time period analyzed and the reference date shown in Section 1.2. If so, the sample sizes should be modified accordingly. There are a few cases in which there may be some ambiguity in the sample size. A brief discussion of these cases follows:

  • CPS. The CPS is carried out monthly, primarily to obtain labor force information. In March of each year, the CPS becomes a mini-census, including such information as income, mobility, family and household composition, and related data. Supplementary items are also covered in other months, such as school enrollment, children ever born and voting (in alternate years.) The sample sizes are virtually identical in 11 of the 12 months. In March, the number of Hispanics in the sample is doubled, with non-Hispanics kept the same as in other months.

    The sample size in each of the 11 months (excluding March) is referred to as the CPS-Monthly sample. The March sample is similarly referred to as CPS-March. In analyzing a CPS data set for Hispanic subgroups, it therefore is important to identify the month in which the information was obtained. The March sample sizes for Asian and Pacific Islanders and American Indians or Alaska Natives are the same as in other months, so that month of data collection does not affect the sample size.

    Since labor force data are collected each month in CPS, it is possible to obtain yearly averages by pooling the data sets for all 12 months of a year. However, the CPS sample retains the same sample units in a 4-month cycle, and there is about a 75 percent overlap in the sample from one month to the next. The effectiveness of the annual sample is thus very much less than 12 times the monthly sample. Section 4.5 of this report discusses the sampling errors of annual averages in CPS.

  • SIPP. The terms "panels" and "waves" are used to describe the SIPP sample. Panels refer to the set of households that comprise a probability sample of the total population and also of subpopulations. Waves refer to the interview cycles; each panel is interviewed several times a year (i.e., several waves) and over the course of a number of consecutive years. The waves reflect the fact that SIPP is mainly viewed as a source of longitudinal data. A panels sample size over the course of the interview waves is intended to be the same, although there is normally some attrition resulting from cumulative nonresponse. This moderate attrition does not change the sample sizes sufficiently to affect any conclusions in this report.

    Currently, a single panel is used. The current panel was introduced in 1996 and will continue through 1999. The sample sizes for SIPP shown in this report are those of the current panel. In earlier years, a rotating panel structure was used, with several panels operating in each year. Before proceeding with a study based on SIPP, an analyst should check the number and size of panels used in the time period of interest. It should also be noted that with the current lack of rotation of the panel, there is very little to be gained by combining years (except, of course, for longitudinal analyses of changes over time.)

  • NHANES. Currently, each year, a new sample of about 5,000 individuals of all ages (comprised of about 1,500 Mexican-Americans, 82 "Other Hispanics," 113 Asian and Pacific Islanders and 24 American Indians or Alaska Natives) is interviewed and examined. The samples among years are independent; consequently the results can be aggregated across years to improve reliability. Most of the past analyses of NHANES have concentrated on the health and nutrition status of detailed age-sex-race/ethnicity groups, and a 6-year accumulation of data is necessary to meet the established precision requirements for the these detailed age groups. Shorter periods can be used for broader age groups, and NHANES III used both a 3-year and 6-year accumulation. The use of independent samples each year in the current NHANES permits considerable latitude for the analyst in combining years and it is expected that most analyses will use several years.

    The sample sizes shown in this report refer to a single years sample. Section 4.4 discusses the effect of combining years of data.

  • MEPS and MCBS. Each years sample in these two surveys is interviewed several times during the course of the year. The purpose of the multiple visits is to shorten the time period for which information is obtained and, thus, reduce the possibility that memory factors will affect the quality of data. The data tapes combine data obtained in the multiple visits, and the sample sizes shown refer to the number of persons for whom annual data are obtained, not the number of interviews.

    Currently, each years MEPS sample consists of two panels; one introduced for the first time that year, and the second carried over from the preceding year. For this reason it is important that sample sizes be verified before attempting to utilize these data. The MEPS sample sizes shown in this report refer both to the new panel introduced in 1999 and the panel carried over from 1998. The MCBS also has a panel structure and we report the sample size for the four panels included in early 1998.

  • NVS. All births and deaths are covered in the NVS Mortality and Natality data sets. As indicated in Section 2.1, this report focuses on the uses of the data for descriptive analyses. From this viewpoint, vital statistics are not subject to sampling error. Consequently, the two NVS data sets are not included in either the tables on effective sample sizes or the subsequent discussion of available precision.

 

Table 3-3.
Approximations of Hispanic sample cases in the data set
Data set Total Hispanic Mexican-
Americans
Puerto Ricans Cubans Central or South American Other Hispanic
Census
   Census 20001 4,508,000 2,850,000 475,000 190,000 650,000 335,000
   ACS 900,000 570,000 95,000 38,000 130,000 67,000
   CPS-March 11,260 6,940 1,190 470 1,685 975
   CPS-Monthly 5,635 3,470 595 235 845 490
   SIPP 10,845 7,181 1,172 372 1,306 814
NCHS/CDC
   NHIS 22,145 13,869 2,353 1,165 2,093 4,758
   NSFG 2,097 1,330 221 88 302 156
   NIS 4,852 3,529 398 99 526 300
   NHANES 1,582 1,500 24 10 32 16
AHRQ
   MEPS 5,375 3,650 600 225 766 134
HCFA
   MCBS 464 254 42 67 52 50
SAMHSA
   NHSDA 5,000 3,170 527 211 721 372
NCES
   NHES 18,804 13,675 1,541 385 2,040 1,162
   ECLS-B 1,979 1,367 160 35 137 280
   ECLS-K 2,957 2,150 242 61 321 183
1 Long form data

NOTE:
The sample cases for each data set reflect the population coverage of the respective surveys. For example, CPS-March covers all persons in the civilian noninstitutional population, whereas NSFG covers women 15 to 44 years of age. The Task 2 descriptions of the respective data sets note the appropriate population coverage. The sample sizes are the number of sample persons in each subgroup, including those that are not identified in the data file.

 

Table 3-4.
Approximations of Asian and Pacific Islander sample cases in the data set
Data set Total
Asian and PI
Chinese Filipinos Japanese Asian Indian Korean Vietnamese Hawaiian Other
Census
    Census 20001 1,580,000 375,000 300,000 180,000 175,000 175,000 135,000 45,000 195,000
    ACS 316,000 75,000 60,000 36,000 35,000 35,000 27,000 9,000 39,000
    CPS-March 4,555 995 850 515 495 485 375 125 565
    CPS-Monthly 4,555 995 850 515 495 485 375 125 565
    SIPP 3,293 745 637 386 370 362 280 95 421
NCHS/CDC
    NHIS 3,284 755 647 356 320 342 356 112 396
    NSFG 327 74 63 38 37 36 28 9 42
    NIS 1,172 265 227 137 131 129 100 33 150
    NHANES 113 27 22 13 12 12 10 3 14
AHRQ
    MEPS 750 152 170 62 111 96 45 17 97
HCFA
    MCBS 151 34 29 18 17 17 13 4 19
SAMHSA
    NHSDA 700 158 135 82 78 77 59 20 90
NCES
    NHES 4,420 999 855 517 495 486 376 128 566
    ECLS-B 2,483 705 467 134 282 278 217 74 325
    ECLS-K 1,870 423 362 219 209 206 159 54 239
1 Long form data

NOTE:
The sample cases for each data set reflect the population coverage of the respective surveys. For example, CPS-March covers all persons in the civilian noninstitutional population, whereas NSFG covers women 15 to 44 years of age. The Task 2 descriptions of the respective data sets note the appropriate population coverage. The sample sizes are the number of sample persons in each subgroup, including those that are not identified in the data file.

 

Table 3-5.
Approximations of American Indian or Alaska Native sample cases in the data set
Data set American Indian and Alaska Native
Census
    Census 20001 330,000
    ACS 67,000
    CPS-March 1,600
    CPS-Monthly 1,350
    SIPP 1,200
NCHS/CDC
    NHIS 978
    NSFG 77
    NIS 460
    NHANES 24
AHRQ
    MEPS 375
HCFA  
    MCBS 25
SAMHSA
    NHSDA 166
NCES
    NHES 1,675
    ECLS-B 50
    ECLS-K 364
1 Long form data

NOTE:
The sample cases for each data set reflect the population coverage of the respective surveys. For example, CPS-March covers all persons in the civilian noninstitutional population, whereas NSFG covers women 15 to 44 years of age. The Task 2 descriptions of the respective data sets note the appropriate population coverage. The sample sizes are the number of sample persons in each subgroup, including those that are not identified in the data file.

 

The effective sample sizes are simply the nominal sample sizes divided by the design effects. They are shown in Tables 3-6 to 3-8. The effective sample sizes will be used to identify data sets that satisfy minimum standards of reliability.

 

Table 3-6.
Effective sample sizes for Hispanics
Data set Total Hispanic Mexican-
American
Puerto Rican Cuban Central or South American Other Hispanic
Census
    Census 20001 4,508,000 2,850,000 475,000 190,000 650,000 335,000
    ACS 900,000 570,000 95,000 38,000 130,000 67,000
    CPS-March 7,507 4,627 793 313 1,123 650
    CPS-monthly 3,757 2,313 397 157 563 327
    SIPP 4,519 2,992 488 155 544 339
NCHS/CDC
    NHIS 14,763 9,246 1,569 777 2,093 1,079
    NSFG 1,234 782 130 52 178 92
    NIS 3,732 2,715 306 76 405 231
    NHANES 727 682 12 6 18 9
AHRQ
    MEPS 4,479 3,042 500 188 637 112
HCFA
    MCBS 422 231 38 61 47 45
SAMHSA
    NHSDA 2,273 1,441 240 96 328 169
NCES
    NHES 13,431 9,768 1,101 275 1,457 266
    ECLS-B 1,649 1,139 133 29 114 233
    ECLS-K 1,183 860 97 24 128 73
1 Long form data

NOTE:
The sample cases for each data set reflect the population coverage of the respective surveys. For example, CPS-March covers all persons in the civilian noninstitutional population, whereas NSFG covers women 15 to 44 years of age. The Task 2 descriptions of the respective data sets note the appropriate population coverage. The sample sizes are the number of sample persons in each subgroup, including those that are not identified in the data file.

 

Table 3-7.
Effective sample sizes for API
Data set Total Asian and PI Chinese Filipinos Japanese Asian Indian Korean Vietnamese Hawaiian Other
Census
    Census 20001 1,580,000 375,000 300,000 180,000 175,000 175,000 135,000 45,000 195,000
    ACS 316,000 75,000 60,000 36,000 35,000 35,000 27,000 9,000 39,000
    CPS-March 3,037 663 567 343 330 323 250 83 377
    CPS-Monthly 3,037 663 567 343 330 323 250 83 377
    SIPP 2,058 466 398 241 231 226 175 59 263
NCHS/CDC
    NHIS 2,433 559 479 264 237 253 264 83 293
    NSFG 234 53 45 27 26 26 20 6 30
    NIS 902 204 175 105 101 99 77 26 115
    NHANES 63 15 12 7 7 7 5 2 8
AHRQ
    MEPS 357 72 81 30 53 46 21 8 46
HCFA
    MCBS 137 31 26 16 15 15 12 4 17
SAMHSA
    NHSDA 318 72 61 37 35 35 27 9 41
NCES
    NHES 3,157 714 611 369 354 347 269 91 404
    ECLS-B 2,069 588 389 112 235 232 181 62 271
    ECLS-K 748 169 145 88 84 82 64 22 96
1 Form data

NOTE:
The sample cases for each data set reflect the population coverage of the respective surveys. For example, CPS-March covers all persons in the civilian noninstitutional population, whereas NSFG covers women 15 to 44 years of age. The Task 2 descriptions of the respective data sets note the appropriate population coverage. The sample sizes are the number of sample persons in each subgroup, including those that are not identified in the data file.

 

Table 3-8.
Effective sample sizes for American Indians or Alaska Natives
Data set Effective sample size
Census
    Census 2000(1) 330,000
    ACS 67,000
    CPS-March 1,067
    CPS-Monthly 1,067
    SIPP 1,000
NCHS/CDC
    NHIS 652
    NSFG 45
    NIS 354
    NHANES 12
AHRQ
    MEPS 179
HCFA
    MCBS 23
SAMHSA
    NHSDA 75
NCES
    NHES 1,196
    ECLS-B 148
    ECLS-K 146
1 Long form data

NOTE:
The sample cases for each data set reflect the population coverage of the respective surveys. For example, CPS-March covers all persons in the civilian noninstitutional population, whereas NSFG covers women 15 to 44 years of age. The Task 2 descriptions of the respective data sets note the appropriate population coverage. The sample sizes are the number of sample persons in each subgroup, including those that are not identified in the data file.