Extending the Utility of Federal Data Bases. Combining Data from Several Surveys


A few items are included in more than one survey: health insurance is covered in NHIS, SIPP, and MEPS; other items are considered basic covariates for multivariate analysis in many surveys, but are important statistics in their own right. Age, sex, and marital status are almost defining characteristics, and they are collected in virtually all questionnaires. Other frequently obtained items are income (personal and/or family income), educational attainment, and labor force status. In Section 5 of this report we discuss the possibility of enhancing the subpopulations sample sizes by combining data from several surveys.

One would like the question wordings to be reasonably consistent among the surveys that will be combined. This is probably not an issue for such demographic items as age, sex, and marital status, or for educational attainment. However, reporting of income, poverty and, to some extent, labor force status and occupation can be quite sensitive to both the question wording and the amount and type of probing carried out by interviewers. A major consideration for income, and possibly labor force, is how much discrepancy in question wording can be tolerated in order to provide a sufficient sample size for reasonable reliability. It may be possible to calibrate the results of various surveys so that adjusted data are in closer conformity.

One additional issue relating to comparability among surveys involves the population covered by the survey, that is: whether the samples represent all 50 states and D.C.; whether each survey includes the entire civilian non-institutional population, excludes some components, or includes some others, such as the military or institutional population. We do not expect this to be an important concern for most purposes, but analysts who are trying to establish historical series may find that even small inconsistencies can raise fundamental questions about the validity of the data.