Extending the Utility of Federal Data Bases. Effective Sample Size for Combined Years


The effective sample sizes for combined years are shown in Tables 4-1 to 4-3. It can be seen that except for NHANES, the effective sample sizes for 2 years are a little less than twice the sample for a single year; similarly the 3 and 5 year effective samples are not quite 3 or 5 times the annual sample sizes. All of the surveys use clustered sample designs and a sequence of several years samples are mostly in the same clusters, or in neighboring ones. The lack of independence among several years samples tends to reduce the effective sample size. We have estimated that the reduction in effective sample size over a 2-year interval is about 17 percent; the reduction for a 3-year period is 25 percent; and the reduction for 5 years is about 35 percent. These come from estimated year-to-year correlations in the sample: year-to-year correlations are expected to average .20, 2 years apart correlations are .10, 3 years apart are .07, and 4 years are .05. The current NHANES samples are independent across years, and, therefore, there is no reduction in effective sample size.

The effective sample sizes in Tables 4-1 to 4-3 are approximations based on even more assumptions and averages than the numbers in Tables 3-6 to 3-8. The sample sizes in each year are subject to sampling errors, and to the vagaries of erratic response rates. This is especially true for the minority subgroups with very small samples; the samples for Cubans or Hawaiian could differ in neighboring years by 10 or 20 percent from the year reflected in our tables. Also, the year-to-year correlations, resulting from the similarities in characteristics in neighboring households, are average values expected over a set of items, similar to the use of average design effects. Nevertheless, the numbers shown in Tables 4-1 to 4-3 indicate the order of magnitude of effective sample sizes and reveal whether useful analyses are possible from each of the data sets.

One feature of the monthly CPS sample should be noted. The monthly CPS includes two kinds of data sets: (1) labor force information and critical demographic items (e.g., age, sex, household relationship, etc.) obtained each month; and (2) supplemental items covered in months other than March. The supplemental items (based on the monthly CPS sample sizes) that are likely to be of greatest interest are number of children ever born, related fertility information, and school enrollment. Voting registration and behavior in the most recent election is obtained every second year, but it is doubtful that combining pairs of years would be meaningful. Voting patterns in presidential and non-presidential years are very different, and such combinations would probably not be analytically revealing. The entries for CPS in Tables 4-1 to 4-3 are restricted to the supplemental items. The annual sample sizes for labor force information, of course, are much larger than the numbers shown, since they are comprised of 12 monthly samples (see Section 4.5.) The supplemental items included in the March interview are based on the same sample as the other monthly supplements, except for Hispanics for whom the sample is doubled.

Since the MEPS and NSFG samples are taken from NHIS respondents, it would be possible to supplement their samples with additional names and addresses from the NHIS. These names and addresses would be a few years old, and thus it may be more convenient to simply combine multiple years of MEPS respondents. The exact timing of these surveys, and the associated costs, would have to be examined before a decision is made on which approach would be preferable.