|Section 1||Table of Contents||Section 3|
The level of precision that is considered adequate for a survey should reflect the kinds of analyses to be made and the effect that errors in the statistics are likely to have on practical uses of the information. When differences of 10 or 20 percent will have trivial effects on policy decisions, fairly large sampling errors are tolerable. In other instances, even small errors could have an important adverse effect on uses of the data. The situation is further complicated by the fact that although precision may be satisfactory for most statistics relating to the total population of the subgroup, it may be inadequate for subdomains such as age-sex subgroups, low-income persons, persons in each region of the U.S., etc. It is frequently found that no matter how large the sample for a particular survey, there will be some desirable analyses for which the sample is insufficient. Some examples are separate studies of babies, teenagers, or the elderly; examination of data for persons with income below the poverty level; the rural population; etc.
Consequently, there is no simple or single standard of reliability that is applicable to all studies that may be carried out. Although most large surveys have multiple objectives, in each case, the principal uses of the data should be considered along with the consequences of errors in the data. An important part of the consideration should be whether there is a need for special treatment of certain subdomains. In Section 2.2 we give examples of how some of the major U.S. surveys approach the problem. The budget that is likely to be available should, of course, also be taken into consideration. Another factor that may play a role is the existence of significant nonsampling errors in the data collection system, since there is no reason to incur the cost of a very large sample size if the main quality problem is poor reporting by the respondents rather than sampling error. There is no point in establishing unrealistic standards that cannot be achieved. Section 2.2 contains a few examples of how these and other considerations have been instrumental in establishing standards for some of the major U.S. multipurpose sample surveys, which in turn have determined the sample sizes.
[ Go to Contents ]
Common standards for precision do not exist for U.S. Government surveys. Each survey is viewed as a unique data system designed to meet specific needs. In most cases, the sample size is determined by agreement on the key analytic requirements of the survey weighed against available funding or a realistic budget, rather than to satisfy abstract notions of analytic goals. In a few cases, Government agencies have articulated the principal analytic and policy uses expected of the data and the sampling errors that would permit these analyses. Some examples follow:
A sample size of 560 was determined to satisfy these requirements in most classes. Many of the Mexican-American age-sex classes had higher design effects than other classes and needed somewhat larger samples.
To achieve specified levels of precision (described below) for four crucial statistics:
The generic prevalence rates are a useful model for an examination of the feasibility of producing minority subgroup statistics, and we will focus on these specifications later in this report. However, as stated earlier, no single set of specifications is likely to meet all conceivable analytic needs.
[ Go to Contents ]
The simple formula for the variance of a sample mean given in most elementary statistics textbooks is . For a proportion or a prevalence rate, this formula is equivalent to .
In these formulas, f is the sampling rate, n is the sample size, is the population variance of the characteristic being estimated, and p is the proportion that is estimated. These formulas apply to the simplest type of situation, that is, use of simple random sampling with all members of the population sampled at the same rate.
In practice, it is rare for population surveys to use simple random sampling. Where interviewing is done on a face-to-face basis (as distinct from telephone or mail data collection) some form of clustering is almost always used to reduce the cost of interview travel. The clustering frequently results from several stages of sample selection, e.g., counties, groups of neighboring households, and members of sample households. Even when telephone or mail is used (e.g., planned as the dominant data collection methods for the census long forms and the ACS); persons within the sample households constitute clusters. The extent to which characteristics of persons within these levels of clustering tend to be correlated influences the size of the sampling variances. In most cases, clustering increases the sampling variances above what would result from a simple random sampling with the same sample size. Variances will also be increased if the sampling rates vary among members of the population. This can come about if some groups are oversampled or undersampled. It can also result from a fairly common practice of selecting a household sample, then choosing one member at random for a more detailed interview. Persons in large households then have smaller probabilities of selection than persons in smaller households. In attempts to compensate for such features of the sample design, statisticians apply devices that tend to reduce variances, principally stratification and sophisticated weighting methods. However, the features that tend to increase variances usually dominate.
The design effect is a measure of the extent to which the interactions of all such features affect the sampling variances. It is defined as the factor by which the variance of an estimate is changed through departure from simple random sampling. It is generally expressed symbolically by d, so that the variance of a mean becomes . As indicated above, d is mostly, though not necessarily, greater than 1. The value of n/d is frequently referred to as the effective sample size, since replacing n by n/d in the expression for permits one to use the formulas for simple random sampling.
Design effects differ greatly from one survey to another, since there are important differences among sample designs. They can also vary among different items measured within a survey, and sometimes among specific population groups. A few examples of such variations are described below:
In spite of the diversity in sampling rates, the NHANES sample sizes provide data with fairly good precision for Mexican-Americans in each of the 14 age-sex domains designated by NCHS for separate analysis. On the other hand, the sample size for even the total API population and for American Indian or Alaska Native is quite low, and it was trivial for individual age-sex groups of these subpopulations.
As the illustrations above indicate, most surveys are subject to a wide array of design effects. If there are a few key statistics in a survey whose importance dominates the analyses and uses of the data (as in the case of unemployment for CPS), then it is useful to concentrate on these statistics in assessing the reliability of the survey estimates. Otherwise, it is sensible to use an average design effect, about midway between the upper and lower levels that are likely to occur. We will follow this latter practice in assessing the ability of the various data sets to meet the needs for data for the different subpopulations of Hispanic and API populations and for American Indian or Alaska Natives.
[ Go to Contents ]
Some of the detailed subgroups are not identified on all of the data tapes. Section 1.3 identifies those subgroups that are not fully described on the data records, and indicates whether the omitted subpopulations were not identified in the interview or, if obtained, were not entered into the data tape. In addition, there currently are slight variations among surveys in the way the race/ethnicity questions are worded but, except for the birth registration system, the surveys appear to be reasonably consistent in their classifications. NHANES III probed more intensively than most other surveys to identify persons whose ancestors migrated from Mexico, even though, at this time, they do not consider themselves Mexican-Americans. However, the intensive probing was dropped for current NHANES. It also is important to note that birth certificates ask for race/ethnicity of mother and father, but not of child. Since information for the father is less likely to be available, published data on births are tabulated by the race/ethnicity of the mother, which can introduce some error into the calculation of rates where the numerators are from the NVS-Natality files, but denominators are drawn from other data systems. This inconsistency, however, is not present for infant mortality rates where the linked birth/infant death data set is used and data are tabulated by the mothers race.
The differences among surveys are expected to largely disappear when the revised OMB standards for collecting race/ethnic data, which permit respondents with mixed ancestry to choose more than one race, are implemented. Most surveys will be converting to the new standards over the course of the next 4 or 5 years. The new classifications will bring greater consistency among surveys. NHIS has been collecting data on multiple race identification since 1982, and NHANES currently follows the NHIS approach. To reduce possible problems of historical comparability, NCHS, BLS, and the U.S. Census Bureau have carried out research on strategies to bridge the changes created by the shift to the new classification of race/ethnicity. The proposed OMB revisions in the standards for the federal collection of race/ethnicity are shown in Appendix C, Task 2 Report.
[ Go to Contents ]
Section 4 of this report discusses the possibility of improving the precision for some of the subpopulations by combining data for several years. An immediate question is how many years can be combined without seriously affecting the usefulness of the data.
As with so many of the other issues that have been raised, there is no single time period that would be uniformly acceptable for all surveys, or for that matter, for all items within some of the surveys. We suggest that the decision on the number of years to be combined be based on how slowly or quickly the characteristic(s) that are measured in a survey change over time. For example, it is unlikely that there will be dramatic changes over the course of a few years in most of the health or nutrition items covered in NHANES, e.g., prevalence of hypertension, high cholesterol levels, obesity, etc. This, of course, is the reason that NCHS has been comfortable in having previous NHANES data collection extend over a 6-year period. Even though each year of the current NHANES will be based on a random sample, there is no reason why 6 or more years cannot be combined for analyses of data for small population subgroups. Fertility patterns also are likely to change only slowly over time. However, since the NSFG is currently carried out intermittently (about every 5-years), some thought would have to be given to whether combining two cycles of NSFG would excessively stretch the ability to describe the current situation. On the other hand, the limited information on fertility collected annually in CPS could probably be combined over a 3 or 4-year period without any harm, as could the data on educational attainment. Economic statistics, however, can undergo strong fluctuations over a few years, or even less, since they are subject to swings in the economy. It is probably unwise to combine more than 2 or 3-years of data on such items as median income or the poverty rate.
[ Go to Contents ]
A few items are included in more than one survey: health insurance is covered in NHIS, SIPP, and MEPS; other items are considered basic covariates for multivariate analysis in many surveys, but are important statistics in their own right. Age, sex, and marital status are almost defining characteristics, and they are collected in virtually all questionnaires. Other frequently obtained items are income (personal and/or family income), educational attainment, and labor force status. In Section 5 of this report we discuss the possibility of enhancing the subpopulations sample sizes by combining data from several surveys.
One would like the question wordings to be reasonably consistent among the surveys that will be combined. This is probably not an issue for such demographic items as age, sex, and marital status, or for educational attainment. However, reporting of income, poverty and, to some extent, labor force status and occupation can be quite sensitive to both the question wording and the amount and type of probing carried out by interviewers. A major consideration for income, and possibly labor force, is how much discrepancy in question wording can be tolerated in order to provide a sufficient sample size for reasonable reliability. It may be possible to calibrate the results of various surveys so that adjusted data are in closer conformity.
One additional issue relating to comparability among surveys involves the population covered by the survey, that is: whether the samples represent all 50 states and D.C.; whether each survey includes the entire civilian non-institutional population, excludes some components, or includes some others, such as the military or institutional population. We do not expect this to be an important concern for most purposes, but analysts who are trying to establish historical series may find that even small inconsistencies can raise fundamental questions about the validity of the data.
[ Go to Contents ]
The surveys listed in the Task 2 report, and which will be analyzed further in the balance of this report, are conducted by federal agencies, or carried our under contract for the agencies. Knowledgeable statistical staffs monitor the survey operations and the quality of the work is generally quite high. There is particular stress on attaining high response rates, and we believe that all of the surveys do about as well as can be expected, given that, with the exception of the vital statistics system, the ACS, and the U.S. Census, the survey responses are voluntary rather than mandatory.
To say that the response rates are acceptable does not mean there are no potential nonresponse problems. All of the major surveys use poststratification in the final stage of weighting to reduce sampling errors, and to compensate as much as possible for nonresponse and undercoverage. There are almost always separate poststratification cells for blacks, Hispanics, and all other race/ethnic groups and NHANES has such cells for Mexican-Americans. The minority subgroups are almost always combined into categories like "total Hispanics" or "total other races" (which includes American Indians and Alaska Natives). Subdomains such as Puerto-Ricans, Cuban-Americans, Central-Americans, etc., are thus combined into a single class, with identical weights. Similarly, all Asians and Pacific Islanders get identical weights. If, in fact, some of these subgroups have lower response rates than the overall rate for the race/ethnic class, and are not separately adjusted, they will be underrepresented in the statistics. A similar situation exists with undercoverage. For example, if illegal aliens tend to avoid reporting (as seems likely) and if a higher proportion of Mexican-Americans are here illegally than in other Hispanic subpopulations (as is also likely), then the uniform weighting will slightly understate Mexican-Americans and overstate other Hispanic subgroups.
Unfortunately, there is not much that can be done to adjust for such occurrences. Agencies already make strenuous efforts to attain high response rates and it is unlikely that further exhortation to improve will be effective. Users of the data, however, should be aware of such limitations in drawing conclusions from the statistics.
[ Go to Contents ]
All of the major statistical agencies estimate sampling errors for their surveys. Since many of the surveys use complex, multi-stage sample designs, estimation of sampling errors is also fairly complex. Even further complications would result from the procedures for enhancing the sample as discussed above; averaging over a number of years, or combining the data from several surveys. Fortunately, software exists for the production of estimates of sampling errors for complex designs that could be adapted to cover averaging over years or combinations of surveys. We therefore see no reason to treat the ability to compute estimates of sampling errors as an impediment to the use of these procedures.
|Section 1||Table of Contents||Section 3|
Top of Page
Table of Contents of Report
Human Services Policy (HSP)
Assistant Secretary for Planning and Evaluation (ASPE)
U.S. Department of Health and Human Services (HHS)
Last updated 9/14/00