# National Invitational Conference on Long-Term Care Data Bases: Conference Package. IV. SAMPLE DESIGN AND ITS EFFECT ON THE ANALYSIS OF THE SURVEY

An important factor in the analysis of any survey, but one that frequently generates confusion, is the appropriate use of sample weights in the analysis. This is because sample weights play different roles in different stages of the analysis and because there are several different methodologies for dealing with the effects of sample design in analysis. The issues become more complex in the current study because of its longitudinal nature.

The first set of issues involves the role of weights in various stages of analysis. One stage of analysis has to do with the testing of statistical hypotheses using the survey data. The basic problem is that the samples are not simple random samples but are probability samples, i.e., different populations are drawn with a pre-specified probability to increase the precision of estimates for certain rare populations. Furthermore, in some sample designs, the samples are drawn from spatially designated clusters to reduce costs. Since persons in each cluster will tend to share certain social-economic and residential characteristics this means that their responses will tend to be correlated, i.e., each person cannot be viewed as providing an independent response.

In the NLTCS these problems are minimized because the sample design is relatively simple. In 1982 the population was only stratified on age, sex and race. In 1984 there was the additional complication that only 47.4% of the non-disabled community dwelling persons were screened-adding an additional weighting factor.

The problem in analysis is that stratification and sample clustering (clustering has little effect in this design) affect the estimate of error variance which is used in our test statistics to determine if a particular hypothesis should be accepted or rejected. The analytic problem is to determine how the sample design affects the variance of our parameter estimates. There are two analytic approaches to this problem. The first is to use some model of randomization to increase the error variance to provide a conservative adjustment to our test statistics. There are several analytic computer programs extant that do this for continuous variables in simple regression models. However, given the simple nature of the study design certain simple calculations can be used to adjust variance estimates. This was illustrated by the Census Bureau for the 1982 cross-sectional sample. A table of adjusted factors is provided in Table 4 below.

 TABLE 4. "a" and "b" Parameters and "f" Factors for Computing Approximate Standard Errors of Estimated Numbers and Percentages of Persons Characteristic Parameters "f" factor a b Black persons or persons receiving medicaid -.00008227 2094 1.4 All other -.00004027 1025 1

In Table 4 are the parameters for two regression equations. Both were obtained by regressing the estimate on the variance of the estimate for each of two groups, i.e., "blacks or persons receiving Medicaid" and "all others." What one does is take the number of persons having a particular characteristic in 1982 and multiply the square of that number by parameter a, and the number itself by parameter b, and add the two products. The square root of this number is the standard error of the estimate. To illustrate, in the 1982 survey there were estimated to be 1,190,764 aged persons requiring personal help in bathing. The formula described above is, symbolically

Standard error of x = ax2 + bx * f

If, as for the example, f = 1.0, then the calculation is

Standard error of x = (-.00004027)(1,190,764)2 + (1025)(1,190,764) * (1.0)

or, 34,109.

Thus, the one standard deviation (68%) confidence interval, is ±34,109 or 1,156,655 to 1,224,873. The 95% confidence interval would be ±2*(34,109). For the confidence interval of differences one uses

Standard error of difference = s2x + s2y - 2r * (sx * sy)

where sx and sy are the standard errors of the two estimates to be compared and r is the correlation coefficient (which can often assumed to be zero). Alternatively, for the 1982 tables, the standard errors of both numbers and percentages were calculated. These are presented in Table 5.

 TABLE 5 A. Standard Errors of Estimated Percentages of Persons Base of estimated percentage (thousands) Estimated Percentage 2 or 98 5 or 95 10 or 90 25 or 75 50 25 2.8 4.4 6.1 8.8 10.1 50 2.0 3.1 4.3 6.2 7.2 100 1.4 2.2 3.0 4.4 5.1 250 0.9 1.4 1.9 2.8 3.2 500 0.6 1.0 1.4 2.0 2.3 750 0.5 0.8 1.1 1.6 1.8 1000 0.4 0.7 1.0 1.4 1.6 2000 0.3 0.5 0.7 1.0 1.1 3000 0.3 1.4 0.6 0.8 0.9 4000 0.2 0.3 0.5 0.7 0.8 5000 0.2 0.3 0.4 0.6 0.7 B. Standard Errors of Estimated Numbers (in thousands) Size of Estimate Size of Estimate Standard Error Size of Estimate Standard Error Size of Estimate 25 5.1 1000 31.4 50 7.2 2000 43.5 100 10.1 3000 52.1 250 15.9 4000 58.8 500 22.4 5000 64.2 750 27.3 - -

The numbers in these tables need to be multiplied by the appropriate "f" values in Table 4. We do not yet have similar tables for the 1984 survey. However, knowing the size of the various sub-samples in 1982 we can present the coefficient of variation for different of the 1982 subsamples (i.e. 10,250 is the number of non-disabled persons planned to be screened in 1984; N=6,089 was approximately the number of persons interviewed in 1982; 1,712 was the number of persons institutionalized before April 1, 1982, and 856 and 428 are half and a quarter of that number). These numbers are presented in Table 6.

 TABLE 6. CV's for Variance Rates and Sample Sizes Sample Size Rate CV - n = 10,250 1% .127 1.6% gives a 10% CV 5% .056 10% .038 25% .022 50% .013 n = 6,089 1% .165 2.7% gives a 10% CV 5% .072 10% .050 25% .029 50% .017 n = 1,712 1% .311 8.9% gives a 10% CV 5% .136 10% .094 25% .054 50% .031 n = 856 1% .439 16.3% gives a 10% CV 5% .193 10% .133 25% .077 50% .044 n = 428 1% .622 28.1% gives a 10% CV 5% .272 10% .187 25% .108 50% .062

An alternative approach to adjusting error variance estimates for sample design effects is based upon the realization that many of these design effects may be of substantive interest. Thus, an alternative approach is to explicitly model the design factors as part of one's analysis so that design effects are explicitly represented. Such an approach has the advantage of helping us better understand the mechanisms generating the phenomenon but the disadvantage of requiring that the correct model be developed. Though it may seem tedious and difficult to search for the "correct" model, rather than using a "general" model of randomization, it should be realized that only by producing the correct model can one really generalize the parameter estimates made beyond the particular sample, i.e., either to the general population or in forecasts of future needs. Thus, in many more situations than normally realized, the search for a model based adjustment for complex sample design effects is a necessity.

A second analytic stage where sample weights are used is in "post" weighting, i.e., where one wishes to recombine parameter estimates for sub-groups to produce the parameter estimates for the total population that was sampled. This usually involves re-weighting the data to reflect the inverse of the probability of selection. This is actually a purely algebraic procedure that is independent of the methods to calculate the effects of sample weights on statistical inferences.