Assessment of Major Federal Data Sets for Analyses of Hispanic and Asian or Pacific Islander Subgroups and Native Americans

Inventory of Selected Existing Federal Databases

Joseph Waksberg
Daniel Levine
David Marker

May 2000

Submitted to:
U.S. Department of Health and Human Services
Office of the Assistant Secretary for Planning and Evaluation

Submitted by:
1650 Research Boulevard
Rockville, Maryland 20850

This report is available on the Internet at:

Table of Contents

1 Content of Report
2 Sources of Information
3 Particular Issues Relating to Content of the Inventory
  List of Tables
2-1 Estimates of U.S. population in the race/ethnic subgroups examined in this report: March 1999
A-1 Approximations of Hispanic sample cases in the data set
A-2 Approximations of Asian and Pacific Islander sample cases in the data set
A-3 Approximations of American Indian or Alaska Native sample cases in the data set
A-4 Information on citizenship, year of immigration, and foreign birth, by survey
  List of Appendices
A Tables A-1 Through A-4
B Inventory of Selected Federal Databases
C Standards for Maintaining, Collecting, and Presenting Federal Data on Race and Ethnicity

1.  Content of Report

This Task 2 report is the first of the two substantive reports in the study to assess the capability of a number of federal surveys: (1) to provide data on the major subgroups of Hispanic, Asian or Pacific Islanders (API), and on American Indian or Alaska Natives, in order to analyze the health, education status, and social and economic well being of these groups; (2) to identify barriers to developing such data; and (3) to identify options for improving the capacity to obtain statistically reliable data about these populations. The report contains information on the applicable sample sizes, and an inventory of existing Federal databases for most of the major demographic, social, economic and health-related surveys carried out by or for U.S. Government agencies. Most of the databases consist of surveys that are carried out annually, or at other regular intervals, so that they provide reasonably current statistical information. However, two of the databases are somewhat different, and do not, strictly speaking, fall into the category of surveys. One is the decennial census; the other is the National Vital Statistics System, which contains data from the birth and death registration systems. These two databases are such important sources of information on demographic characteristics, economic status, and selected health items that it seemed appropriate to include them.

The ability of a survey to provide data on population subgroups with reasonable precision depends on two factors:

  1. The questionnaire or other instruments used for data collection must identify the subgroups and record the information. In turn, the detail also must appear in the microdata file. This is obviously essential and Appendix B describes both the specific questions on race and ethnicity used in each survey and the detailed race/ethnicity codes which are recorded; and
  2. The sampling errors on estimates of the characteristics of the subgroups need to be low enough for the statistics to be reasonably reliable. The sampling errors are mostly, but not exclusively, dependent on the sample size in each survey. Some of the surveys oversample Hispanics, which reduces sampling errors for the Hispanic subgroups. However, since the surveys operate with fixed budgets, the increased Hispanic samples result in a reduction in sample size for other population groups, which increases the sampling errors for Asian and Pacific Islanders and American Indians. In addition, the survey designs need to be taken into account in considering the appropriate sample sizes. For example, although labor force information is obtained monthly in the Current Population Survey (CPS) conducted by the Bureau of the Census, supplemental items designed to collect a wide variety of other social and economic information are added to individual months during the year. Consequently, only the monthly sample size applies to such information as income, family status, migration, school enrollment, etc.; in the case of the labor force data, on the other hand, information for different months can be combined to increase the sample size and produce quarterly, semi annual, or annual labor force estimates with improved reliability. In another example, the sample design for the National Health and Nutrition Examination Survey (NHANES) is focused on the need to analyze health conditions for rather narrow age-sex groups for Mexican-Americans, blacks, and all other groups. The precision of estimates for the total population is considered of secondary importance. The requirement for approximately equal sample sizes in the various age-sex domains influences the sampling errors for the statistics on the total population, and on data for all Hispanics and all Asians and Pacific Islanders.

The attached tables (Appendix Tables A-1 through A-3) provide detailed information on the sample sizes. The inventory (Appendix B) contains a concise description of the purpose of the survey, the kinds of data obtained, interview methods, and publication policy, as well as the agency website address for those desiring additional detail. Note that the inventory description is limited to what is collected and what is available on the micro-data file, since these are most relevant to the assessment. We also have included information on whether and how the subpopulations are identified, and whether bilingual interviewers are used. Table A-4 describes the availability of information on citizenship, year of immigration, and whether foreign born.

Subgroups and Databases Examined

The subgroups of interest are:

  1. Hispanic:
  2. Asian or Pacific Islander: (Note that the new OMB standards (Appendix C) splits this category into "Asian" and "Native Hawaiian or Other Pacific Islander.")
  3. American Indian or Alaska Native

The databases examined and the appropriate reference dates are:
Data set Reference date
Census 2000 April 1, 2000
American Community Survey 2003, proposed
Current Population Survey-March March 1998
Current Population Survey-Monthly Average month, 1998
Survey of Income and Program Participation Wave 1, 1996 Panel
National Health Interview Survey 1998
National Vital Statistics System-Natality 1997
National Vital Statistics System-Mortality 1997
National Survey of Family Growth 1995
National Immunization Survey 1999
National Health and Nutrition Examination Survey 1999
Medical Expenditure Panel Survey 1999
Medicare Current Beneficiary Survey Early 1998, 4 panels
National Household Survey on Drug Abuse


National Household Education Survey


Early Childhood Longitudinal Survey - Birth Cohort

Year 1, 2000

Early Childhood Longitudinal Survey - Kindergarten Cohort

Year 1, Fall 1998

Since both sample sizes and designs are subject to changes over time as a result of budget actions, congressional or programmatic initiatives, or baseline revisions, it is important that users or interested parties refer to current documentation or inquire of the appropriate agency whether any important changes in sample size or design have been made.

It is important to note that these reports are to serve as a general reference to a potential audience of analysts and policy makers seeking information on the possible uses of these databases as a source of data on race/ethnic groups of interest, rather than as technical handbooks. We would urge users to seek appropriate professional assistance or expertise, either from the relevant agency or from other sources, to deal with specific technical issues.

1  See page 2 of NCHS Report, Sample Design:  Third National Health and Nutrition Examination Survey, Series 2, No. 13, for a more detailed discussion.

2.  Sources of Information

To the extent possible, information was obtained from the staff of the Government agency responsible for each survey. Westat contacted each agency initially with a structured set of questions, but with the understanding that the important objective was to obtain the desired information rather than to adhere rigidly to a fixed format of questions. Other sources were used as required to fill in any data gaps, for example, published or unpublished descriptions of the sample designs and survey procedures were consulted, as were Westat staff with personal knowledge of the content and procedures of many of the surveys. (Westat conducts some of the surveys under contract; in other cases, Westat helped develop the sample designs; further, some staff members previously held senior positions at the Census Bureau.)

We note that in a number of cases direct information on the sample sizes for the race/ethnicity subgroups was not available even though the total sample size was known, and frequently the number of all Hispanics and of all Asian and Pacific Islanders, as well. Further, the Census Bureau does not prepare independent current estimates for the Asian and Pacific Islander subpopulations, nor publish counts of the number of sample persons or households in each API subpopulation for the current surveys, such as the CPS. The Census Bureau does prepare annual population estimates for Hispanics, Native Americans, and total Asian and Pacific Islanders, by updating the most recent Census counts (currently, the 1990 Census) through the use of birth and death records, and estimates of net migration, including an allowance for illegal immigration, and the most recent estimates are shown in Table 2-1. However, the subpopulations are not included in this program. Similarly, most of the current surveys sponsored by the major statistical agencies do not publish data for the subpopulations. (The CPS does provide a limited amount of data annually for the larger Hispanic subpopulations but not for the Asian and Pacific Islander subgroups.)

The detail shown in Table 2-1 is derived from the March 1999 CPS. The Hispanic subpopulations are estimated directly from the survey; the API subpopulation detail, however, was obtained by applying the percent distributions for the subgroups as reported in the 1990 Census to Census Bureau estimates for March 1999 of the total number of Asian and Pacific Islanders.

Table 2-1.
Estimates of U.S. population in the race/ethnic subgroups examined in this report:  March 1999

Race/ethnic group Total population1
Percent of total

Civilian noninstitutional population 271,743 100.0
Hispanics, total 31,689 11.7
   Mexican-American 20,652 7.6
   Puerto Rican2 3,039 1.1
   Cuban 1,370 0.5
   Central or South American 4,536 1.7
   Other Hispanic 2,091 0.8
Asian or Pacific Islanders, total 10,492 3.8
   Chinese 2,370 0.9
   Filipinos 2,028 0.7
   Japanese 1,227 0.4
   Asian-Indian 1,175 0.4
   Korean 1,154 0.4
   Vietnamese 892 0.3
   Hawaiian 304 0.1
   Other 1,342 0.5
American Indian or Alaska Native 2,396 0.9

1 Data for Hispanics and Hispanic subgroups are from the March 1999, Current Population Survey. Since current estimates for the Asian or Pacific Islanders subgroups are not available, 1990 Census detail was adjusted to the March 1999 total for the group to produce an approximate distribution. The estimate for American Indian or Alaska Natives is for July 1999.

2 Does not include persons living in Puerto Rico.

There were several problems in preparing the sample sizes shown in Tables A-1 to A-3 for the current surveys (i.e., all data collection systems covered in this report except the decennial census, ACS, and the vital statistics records). For a number of surveys, there was no way of obtaining exact subpopulation sample sizes since the data records do not contain a subpopulation identifier. In other current surveys, the data records do indicate each sample person's subpopulation identity, but the detail was not tabulated. Requesting special tabulations would have been both fairly expensive and caused significant delays in the timetable for the project.

However, since it is necessary to have the complete distribution in order to assess the survey's ability to provide reasonably reliable data, and recognizing that reasonable approximations would serve the goals of this study adequately, we have estimated the appropriate subpopulation sample sizes where not available. Tables A-1 through A-3, therefore, contain approximations of the number of sample cases for each subpopulation in a given database. The numbers shown include those published by the Federal agencies responsible for the conduct of the survey, or provided separately, along with a number of derived estimates, prepared for the most part by using the distribution of the population in Table 2-1 as an approximation of the sample distribution. For surveys whose target populations were different from the total population (young children for NIS, ECLS-B, and ECLS-K, females 15 to 44 years for NSFG, and persons 65 years and over for MCBS), the population distribution for the target group, or for a reasonable approximation to this group, was used. Some of the surveys oversample Hispanics, and an allowance for the oversampling is included in the estimates.

The derived estimates of sample sizes for those databases which do not currently identify all subpopulations also provide an indication of the potential value of the databases as a source of useful information were the appropriate agency to record the race/ethnic detail on the data file.

Although the estimates shown in Tables A-1 through A-3 may differ somewhat from the actual counts and, thus, introduce a degree of approximation in the analyses of these data, we do not believe this will affect in any important way the conclusions to be drawn. The reason for the emphasis on the sample sizes in both this and the Task 3 report is that the sample size is the most important factor in determining the standard errors of the estimates of a survey which, of course, establishes the precision that is achieved. The standard error of a proportion estimated in a survey can be expressed as

formula: sigma=square root(dp(1-p)/n)

where p is the proportion being estimated, d is the design effect which depends on the sample design and the specific item being estimated, and n is the sample size. Consequently, the standard errors move rather slowly with changes in the value of n. For example, if the estimates of sample sizes in Tables A-1 to A-3 were off by 10 percent, this would only cause an error of 5 percent in the estimates of the standard errors. Similarly, if the sample sizes were actually 20 percent higher or lower than those shown in Tables A-1 to A-3, the estimates of standard error would be in error by only 10 percent (e.g., ±8.8 percent instead of ±8 percent). We do not believe that this would affect a decision on whether a survey can provide useful data on the subpopulations, or the amount of sample increase required to produce reliable data.

We also note that standard errors will vary greatly among the statistics estimated in each survey because of the impact of the values of d and p in the expression for the standard error. For example, p may be of the order of 0.10 for children without health insurance, but 0.30 for the number of children below poverty. Consequently, a decision regarding a survey's ability to provide "adequate reliability" will need to focus on either the reliability for a few specific items or on the average among a group of items. Thus, having reasonable approximations, as distinct from exact standard errors, will not appreciably influence any of the conclusions in this report.

3.  Particular Issues Relating to Content of the Inventory

There are a number of issues that will affect the ability of the surveys to provide statistical data on the minority subgroups, or in some cases to permit data from several surveys to be combined for improved reliability. A detailed discussion of these issues will be included in the Task 3 report, but it seems useful to call attention to them now.

