Assessment of Major Federal Data Sets for Analyses of Hispanic and Asian or Pacific Islander Subgroups and Native Americans: Inventory of Selected Existing Federal Databases. Sources of Information


To the extent possible, information was obtained from the staff of the Government agency responsible for each survey. Westat contacted each agency initially with a structured set of questions, but with the understanding that the important objective was to obtain the desired information rather than to adhere rigidly to a fixed format of questions. Other sources were used as required to fill in any data gaps, for example, published or unpublished descriptions of the sample designs and survey procedures were consulted, as were Westat staff with personal knowledge of the content and procedures of many of the surveys. (Westat conducts some of the surveys under contract; in other cases, Westat helped develop the sample designs; further, some staff members previously held senior positions at the Census Bureau.)

We note that in a number of cases direct information on the sample sizes for the race/ethnicity subgroups was not available even though the total sample size was known, and frequently the number of all Hispanics and of all Asian and Pacific Islanders, as well. Further, the Census Bureau does not prepare independent current estimates for the Asian and Pacific Islander subpopulations, nor publish counts of the number of sample persons or households in each API subpopulation for the current surveys, such as the CPS. The Census Bureau does prepare annual population estimates for Hispanics, Native Americans, and total Asian and Pacific Islanders, by updating the most recent Census counts (currently, the 1990 Census) through the use of birth and death records, and estimates of net migration, including an allowance for illegal immigration, and the most recent estimates are shown in Table 2-1. However, the subpopulations are not included in this program. Similarly, most of the current surveys sponsored by the major statistical agencies do not publish data for the subpopulations. (The CPS does provide a limited amount of data annually for the larger Hispanic subpopulations but not for the Asian and Pacific Islander subgroups.)

The detail shown in Table 2-1 is derived from the March 1999 CPS. The Hispanic subpopulations are estimated directly from the survey; the API subpopulation detail, however, was obtained by applying the percent distributions for the subgroups as reported in the 1990 Census to Census Bureau estimates for March 1999 of the total number of Asian and Pacific Islanders.


Table 2-1.
Estimates of U.S. population in the race/ethnic subgroups examined in this report:  March 1999

Race/ethnic group Total population1
Percent of total

Civilian noninstitutional population 271,743 100.0
Hispanics, total 31,689 11.7
   Mexican-American 20,652 7.6
   Puerto Rican2 3,039 1.1
   Cuban 1,370 0.5
   Central or South American 4,536 1.7
   Other Hispanic 2,091 0.8
Asian or Pacific Islanders, total 10,492 3.8
   Chinese 2,370 0.9
   Filipinos 2,028 0.7
   Japanese 1,227 0.4
   Asian-Indian 1,175 0.4
   Korean 1,154 0.4
   Vietnamese 892 0.3
   Hawaiian 304 0.1
   Other 1,342 0.5
American Indian or Alaska Native 2,396 0.9

1 Data for Hispanics and Hispanic subgroups are from the March 1999, Current Population Survey. Since current estimates for the Asian or Pacific Islanders subgroups are not available, 1990 Census detail was adjusted to the March 1999 total for the group to produce an approximate distribution. The estimate for American Indian or Alaska Natives is for July 1999.

2 Does not include persons living in Puerto Rico.

There were several problems in preparing the sample sizes shown in Tables A-1 to A-3 for the current surveys (i.e., all data collection systems covered in this report except the decennial census, ACS, and the vital statistics records). For a number of surveys, there was no way of obtaining exact subpopulation sample sizes since the data records do not contain a subpopulation identifier. In other current surveys, the data records do indicate each sample person's subpopulation identity, but the detail was not tabulated. Requesting special tabulations would have been both fairly expensive and caused significant delays in the timetable for the project.

However, since it is necessary to have the complete distribution in order to assess the survey's ability to provide reasonably reliable data, and recognizing that reasonable approximations would serve the goals of this study adequately, we have estimated the appropriate subpopulation sample sizes where not available. Tables A-1 through A-3, therefore, contain approximations of the number of sample cases for each subpopulation in a given database. The numbers shown include those published by the Federal agencies responsible for the conduct of the survey, or provided separately, along with a number of derived estimates, prepared for the most part by using the distribution of the population in Table 2-1 as an approximation of the sample distribution. For surveys whose target populations were different from the total population (young children for NIS, ECLS-B, and ECLS-K, females 15 to 44 years for NSFG, and persons 65 years and over for MCBS), the population distribution for the target group, or for a reasonable approximation to this group, was used. Some of the surveys oversample Hispanics, and an allowance for the oversampling is included in the estimates.

The derived estimates of sample sizes for those databases which do not currently identify all subpopulations also provide an indication of the potential value of the databases as a source of useful information were the appropriate agency to record the race/ethnic detail on the data file.

Although the estimates shown in Tables A-1 through A-3 may differ somewhat from the actual counts and, thus, introduce a degree of approximation in the analyses of these data, we do not believe this will affect in any important way the conclusions to be drawn. The reason for the emphasis on the sample sizes in both this and the Task 3 report is that the sample size is the most important factor in determining the standard errors of the estimates of a survey which, of course, establishes the precision that is achieved. The standard error of a proportion estimated in a survey can be expressed as

formula: sigma=square root(dp(1-p)/n)

where p is the proportion being estimated, d is the design effect which depends on the sample design and the specific item being estimated, and n is the sample size. Consequently, the standard errors move rather slowly with changes in the value of n. For example, if the estimates of sample sizes in Tables A-1 to A-3 were off by 10 percent, this would only cause an error of 5 percent in the estimates of the standard errors. Similarly, if the sample sizes were actually 20 percent higher or lower than those shown in Tables A-1 to A-3, the estimates of standard error would be in error by only 10 percent (e.g., ±8.8 percent instead of ±8 percent). We do not believe that this would affect a decision on whether a survey can provide useful data on the subpopulations, or the amount of sample increase required to produce reliable data.

We also note that standard errors will vary greatly among the statistics estimated in each survey because of the impact of the values of d and p in the expression for the standard error. For example, p may be of the order of 0.10 for children without health insurance, but 0.30 for the number of children below poverty. Consequently, a decision regarding a survey's ability to provide "adequate reliability" will need to focus on either the reliability for a few specific items or on the average among a group of items. Thus, having reasonable approximations, as distinct from exact standard errors, will not appreciably influence any of the conclusions in this report.