Extending the Utility of Federal Data Bases. Surveys and Race/Ethnicity Groups Meeting Standards for Precision

05/01/2000

A comparison of the effective sample sizes in Tables 3-6 to 3-8 with the numbers needed to meet alternate levels of precision shown in Table 3-1 indicate which race/ethnic subgroups meet these standards for each of the surveys.

We should like to reiterate the caveats mentioned earlier in the discussion of these standards. The sample sizes in Table 3-1 will provide the coefficient of variation for the indicated estimate of prevalence of the total population in the race/ethnic subgroup (or of the total target population of the survey; e.g., females 15-44 for NSFG, person 65 or older for MCBS, etc.) If the contemplated analysis includes examining subsets of the total, such as individual age groups, urban-rural residence, or low-income vs. higher-income persons, much larger sample sizes are needed; essentially each subset would require approximately the sample sizes shown in Table 3-1. Since the specific studies to be carried out have not yet been developed, this report does not contain a provision for subset analysis, but the possibility of the need for such statistical breakdowns and their implications should be kept in mind.

Most of the surveys use an identical sampling rate for all persons in each race/ethnic group. In these surveys, the sample size for any subset can be estimated by taking the proportion of the sample equal to the proportion of the relevant population in that subset. For example, for analysis of data by gender, the male (and female) sample will be equal to about one-half the total sample. Similarly, for an age group containing about 20 percent of the relevant population, the sample will be 20 percent of the total sample in the race/ethnic subgroup. Similar relationships hold for other subsets, such as regional breakdowns, income classes, etc. For subset analyses, the nominal and effective sample sizes in the tables, which follow, should be adjusted to reflect the portion of the subgroup to be analyzed.

There are a few exceptions to the use of a common sampling rate for all members of a subgroup. NHANES focuses on 52 age-sex-race/ethnicity subsets, and uses approximately the same sample sizes for each. The 52 groups are described in several reports on the methodology of NHANES, and analysts concerned with subsets of the race/ethnicity subgroups should refer to the NHANES publications for appropriate methods of estimating the sample sizes. SIPP oversamples persons in poverty. For subset analyses comprising persons in poverty (or items correlated with poverty), the analyst should obtain a description of the current SIPP sample and use it to estimate the sample size.

Secondly, the design effects in Table 3-2 that were inputs to the calculation of the effective sample sizes basically apply to data that are not heavily clustered within households. Examples of statistics that are not clustered, or only moderately clustered are: smoking status, presence of specific chronic illnesses such as hypertension or arthritis, occupation, and very large expenditures for medical care during the year. For such items, members of a household are unlikely to have the same characteristics. On the other hand, as is indicated in footnote 1 of Table 3-2, items such as poverty status, health insurance, urban-rural residence, etc. tend to be identical for all members in a household, and the design effects are usually two to three times as large as those in Table 3-2. Other examples of items with high clustering effects are: mobility status, whether or not foreign born, and income class. Such items will tend to be identically reported within a household so that obtaining the statistics from all members of a household is no more useful than an interview with only one household member. In such instances, the design effect is increased by a factor equal to the average household size, that is by a factor of about 3.5 for Asian and Pacific Islanders, 4.3 for American Indians and Alaska Natives and 3.6 for Hispanics. The average household size (and consequently the design effects) can differ among the subgroups that are the focus of this report. For example, the average household size for Hispanic subgroups varies from a low of 2.6 for Cubans to 3.9 for Mexicans. An analyst should check the household sizes of the subgroups to be studied if highly clustered items are important variables, and modify the design effects accordingly. An alternate way of accomplishing the same goal for highly clustered items is to treat the sample size as the number of households in the sample rather than the number of persons. The nominal and effective sample sizes in the various tables should then be divided by the average household size.

Results of the comparisons of Tables 3-6 to 3-8 with Table 3-1 are summarized below. Table 3-1 indicates the sample size cut-offs for various levels of confidence in the data. Thus, an effective sample size of 500 satisfies requirements for a 20 percent CV for all prevalence rates except very rare ones (i.e., p = .01); an effective sample of 1,000 will provide a CV of .10 on prevalence rates greater than or equal to .10, as well as satisfying the criteria mentioned for a sample of 500; and a sample size of about 2,000 to 2,500 will produce CVs of .20 or better for prevalence rates as low as .01.

  • Census 2000/ACS. It is clear that the Census 2000 and the ACS samples are sufficiently large to satisfy any reasonable precision requirement.
  • CPS. Both the March CPS, which oversamples Hispanics by a factor of 2, and the monthly CPS satisfy virtually all requirements for Mexican-Americans except the most stringent one, i.e., the CV of .10 on a prevalence rate of .01. (The monthly sample is a little short of what is needed for a CV of .20 on a rate of .01, but the difference is negligible.)

    For prevalence levels of .05 or greater, the March sample of Central or South Americans satisfies almost all of the requirements. The March samples of Puerto Ricans and "other" Hispanics is satisfactory for prevalence rate of .05 or greater when CVs of .20 or more are required, and for rates of .15 or greater when a CV of .10 is needed.

    Other than for March, the monthly CPS samples for all Hispanic groups except Mexican-Americans and Central or South Americans are fairly small and only provide the sample needed for CVs of .20 or greater with rates of .05 or more. This is also true of the March Cuban sample; the monthly Cuban sample produces even less reliability.

    The CPS American Indian or Alaska Native sample is sufficient for CVs of .20 or greater with prevalence rates of .05 or greater. It is also large enough to provide a CV of .10 when the prevalence rates are .10 or greater.

    The Chinese and Filipino samples are inadequate when the rate is as low as .01, but will provide CVs of .20 for rates .05 or greater, and CVs of .10 for rates above .15. The Japanese, Asian-Indian, Korean and "other" API samples are quite similar and are mostly sufficient to provide CVs of .20 for rates of .10 or greater and a CV of .30 for a .05 rate. The Hawaiian sample is quite small and satisfies hardly any of the requirements.

  • SIPP. The Wave 1, 1996 Mexican-American sample meets all the precision requirements except the most stringent one that is to achieve a 10 percent CV on a .01 prevalence rate. The Puerto Rican and Central and South American sample will achieve a CV of .20 or better for prevalence rates of .05 or greater. Estimates for Cubans and Other Hispanics only will fulfill rather modest requirements. The Chinese, Filipino and American Indian or Alaska Native samples satisfy moderate requirements, but the Hawaiian sample is quite small and can only satisfy the most generous requirements. The effectiveness of the sample may weaken somewhat for successive interviewing waves, as cumulative nonresponse affects the sample size.
  • NHIS. As a result of the oversampling of Hispanics, the annual Mexican-American sample fulfills the requirements for all CV and prevalence rates, except for a CV of .10 on a .01 prevalence rate, and it comes close to meeting that goal.

    The Central and South American annual sample satisfies the criteria for precision for prevalence rates of .05 or greater. The Puerto Rican and "other" Hispanic annual samples meet all the requirements when the prevalence rate is .05 or greater, except the goal of a CV of .10 for a prevalence rate of .05. The smaller Cuban sample is still large enough to obtain a CV of .20 or better for prevalence rates of .05 or greater, and to provide a CV of .10 for prevalence rate of .15 or more.

    The annual American Indian or Alaska Native sample is close to that of Cubans, and will achieve approximately the same levels of precision.

    The Chinese and Filipino samples are a little smaller than the American Indian or Alaska Native sample, but they still will satisfy similar goals, that is, they will provide CVs of .20 or better for prevalence rates of .05 or greater. The other Asian and Pacific Islander groups will only meet the most modest criteria, a .30 CV for rates of .05 or greater and a .20 CV when the rate is .10 or more.

  • NSFG. Mexican-Americans comprise the only population subgroup that can provide reasonable precisiona CV of .20 for prevalence rates of .05 or more and a CV of .10 for rates of .15 or more. All of the other subgroups could satisfy only very minimal standards.
  • NIS. The Mexican-American sample satisfies all precision requirements except for a CV of .10 on a .01 prevalence rate. None of the other race/ethnic subgroups do very well. The Puerto Rican, Central and South American, and American Indian or Alaska Native samples can meet moderate standards a .20 CV on prevalence rates of .10 or greater and a .30 CV on a prevalence rate of .05. The other race/ethnic subgroup could provide only crude estimates.
  • NHANES. The NHANES sample was designed specifically to provide good reliability for Mexican-Americans, but only when several years of data are combined. The annual sample size is fairly modest and will only provide a CV of .20 for prevalence rates equal to or greater than .20. A CV of .10 is achieved for prevalence rates equal to or greater than .15. None of the other race/ethnic groups can provide usable annual data.
  • MEPS. The Mexican-American sample satisfies all of the precision requirements for prevalence rates of .05 and greater and even does fairly well with rates of .01. Some modest analysis is possible for Puerto Ricans and Central or South Americans. The samples on the other race/ethnic subgroups are too small to be useful.
  • MCBS. The sample sizes are too small for subgroup analyses, even for Mexican-Americans.
  • NHSDA. Mexican-Americans could provide a CV of .10 for prevalence rates of .10 or more and CVs of .20 for rates of .05. Some limited analysis is possible of Puerto-Ricans and Central or South Americans. None of the other subgroups would provide useful data.
  • NHES.The Mexican-American sample meets, or comes very close to meeting, all of the precision requirements. The Puerto-Rican, Central or South American, and American Indian or Alaska Native samples are reasonably large, and would produce a CV of .20 for prevalence rates of .05, and a CV of .10 for rates of .10 or greater. The Chinese and Filipino samples are large enough for a CV of .20 with a prevalence rate of .05 or greater. Only limited use is possible of the other population subgroups.
  • ECLS-B. Using the sample sizes at the initial interview, Mexican-Americans will provide a CV of .20 or better for prevalence rates of .05 or more, and a CV of .10 or better on rates of .10 or greater. The Chinese sample will achieve CVs of .20 or better on rates of .05 or greater. The other subgroups would satisfy only very minimal requirements.
  • ECLS-K. Subgroup analysis of the results of the Year 1 interview essentially would have to be restricted to Mexican-Americans. Their estimates of prevalence rates of .10 or greater would be subject to a CV no greater than .10, and prevalence rates of .05 would have a CV of .20.

The analysis above can be summarized as follows. The vital statistics records, Census 2000 and the ACS will permit detailed and complex analyses of all race/ethnic subpopulations. The March CPS, the NHIS, and NHES can produce quite accurate statistics for Mexican-Americans, moderately good data for Puerto-Ricans and Central or South Americans, and acceptable data for the other Hispanic subgroups, with the possible exception of Cubans. Data for Chinese, Filipinos, and American Indian or Alaska Native would be fairly reliable. Only limited analysis could be made of data for the remaining API subgroups. The monthly CPS and SIPP would be weaker for Hispanics, but mostly still acceptable. For the other surveys, acceptable precision is only possible for Mexican-Americans, and MCBS would not even be acceptable for that subgroup.

It is important to remember that the above analyses apply to the ability of the surveys to provide acceptable accuracy on prevalence rates (or percentage distributions) of total persons in each subpopulation. Many surveys require examination of important subsets of the population, as well as the total. For example, NHANES concentrates on age-sex-race/ethnicity subgroups, MEPS examines low-income persons as well as the total population, and an analytic group in the NSFG is teenagers, by race/ethnicity. For such analyses, the survey needs to have each subset have the sample sizes in Table 3-1. Thus, a simple four-way breakdown of the population, such as persons under or over 25 years by sex, would require a sample four times as great as the numbers in Table 3-1.

Table 3-9 contains guidance on the ability of the various databases to provide acceptable precision levels, as follows:

  1. Detailed cross-classification is possible with reasonable precision;
  2. Some limited cross-classification is possible;
  3. Only simple distributions are possible; and
  4. No analysis is possible.

The classifications are subjective, and it is possible to reach different conclusions on the levels of precisions that are reasonable. An analyst should determine how much error can be tolerated before reaching a conclusion on the detailed analysis to be carried out. Once again, given the possible changes in sample size or design, as well as the use of overlapping samples, we urge that, prior to using a particular data file, the current sample sizes and design effects be verified.

 

Table 3-9.
Adequacy of databases for provision of data with acceptable precision
(see footnote* for description of codes used)
Database Hispanic American Indian or
Alaska Native
Mexican-American Puerto Rican Cuban Central & South American Other
Census
    Census 2000 A A A A A A
    ACS A A A A A A
    CPS-March A C C B C B
    CPS-Monthly B C D C C B
    SIPP B C D C C B
NCHS/CDC
    NHIS A B C B B C
    NSFG C D D D D D
    NIS B C D C C C
    NHANES C D D D D D
AHRQ
    MEPS B C D C C D
HCFA
    MCBS C D D D D D
SAMHSA
    NHSDA B C D C D D
NCES
    NHES A B C B C B
    ECLS-B B D D D C D
    ECLS-K C D D D D D
* Level of detail possible that can be attained with adequate precision Effective sample sizes
  A    Detailed cross-classification possible 4,000 or more
  B    Some limited cross-classification 1,000 to 3,999
  C    Only simple distributions 200 to 999
  D    Analysis not possible Under 200

 

Table 3-9. (continued)
Adequacy of databases for provision of data with acceptable precision
(see footnote* for description of codes used)
Data set Chinese Filipino Japanese Asian Indian Korean Vietnamese Hawaiian Other
Census
   Census 2000 A A A A A A A A
   ACS A A A A A A A A
   CPS-March C C C C C C D C
   CPS-Monthly C C C C C C D C
   SIPP C C C C C D D C
NCHS/CDC
   NHIS C C C C C C D C
   NSFG D D D D D D D D
   NIS C D D D D D D D
   NHANES D D D D D D D D
AHRQ
   MEPS D D D D D D D D
HFCA
   MCBS D D D D D D D D
SAMHSA
   NHSDA D D D D D D D D
NCES
   NHES C C C C C C D C
   ECLS-B C C D C C D D C
   ECLS-K D D D D D D D D
* Level of detail possible that can be attained with adequate precision Effective sample sizes
  A    Detailed cross-classification possible 4,000 or more
  B    Some limited cross-classification 1,000 to 3,999
  C    Only simple distributions 200 to 999
  D    Analysis not possible Under 200

The ability to produce acceptable data also depends on whether the survey collects the detailed race/ethnicity description of each sample person and enters the code in the data set. The Task 2 report indicated a few cases in which not all subpopulations were identified. Many of the surveys simply ask whether the sample person is an Asian or Pacific Islander without obtaining additional detail. The NVS, both natality and mortality, record the identification of Chinese, Japanese, Hawaiian, and Filipinos in all 50 states, but identify the other ethnic groups -- Vietnamese, Asian-Indian, Korean, Samoans, and Guamanians -- in only nine states which contain about two-thirds of the U.S. population in each of these groups. Obviously, the identifications and coding in the surveys and the NVS would need to be expanded to make tabulations possible.