Extending the Utility of Federal Data Bases. Combining Results from Several Surveys


Assessment of Major Federal Data Sets for Analyses of Hispanic and Asian or Pacific Islander Subgroups and Native Americans:
Extending the Utility of Federal Data Bases


As mentioned earlier, a few items are included in several different data sets, e.g., health insurance is covered in NHIS and SIPP and key basic demographic variables such as age, sex, and household relationship are included in almost all surveys, as are certain social and economic items, including income and educational attainment. It is possible to improve estimates of these items for minority subgroups by combining the results from the various surveys requesting similar information. The sample sizes for most subgroups are already fairly high in NHIS and SIPP, and estimates of health insurance from the combined data sets would reduce the sampling errors even more and thus permit the analysis of subgroups, such as specified ages or geographic divisions. It should be noted that health insurance is measured somewhat differently in the surveys, and it is not clear whether the increase in sample size from the combinations of surveys compensates for the problems arising from differences in question wordings that exist.

Even greater reductions in sampling errors are possible for the basic demographic and related characteristics that appear in almost all surveys since they are considered essential covariate items. However, we doubt that it is necessary. These items will be covered in the ACS, which the U.S. Census Bureau expects to initiate in the next few years. The ACS sample will dwarf the samples of the other government surveys, so that it seems sensible to base the analysis of such items as age distributions, income, education, geography, etc. on the ACS. Including the other surveys would hardly reduce the sampling errors. Secondly, the ACS data would not be subject to procedural differences among surveys, e.g., slightly different question wordings, variation in response rates, etc., as would be the case with a combined data set. The ability to improve statistics by combining data from a number of surveys is essentially restricted to a handful of items. Clearly, most information collected in NHIS is not repeated in SIPP or in the other surveys, and the same situation exists in other pairs of data sets. Analysis of the broad array of data items in a survey cannot be improved by combining surveys, unlike the improvements possible by averaging over time.

The MEPS and NSFG samples are subsets of persons in NHIS and there are advantages to combining NHIS data with information from the two surveys, e.g., crossclassifying MEPS or NHIS data with selected NHIS variables, or using NHIS as a source of controls for poststratification. We do not discuss these uses of combinations of surveys in this report because they do not dramatically contribute to the ability of the surveys to provide reasonably reliable data for subpopulations.