Methodological Issues in the Evaluation of the National Long Term Care Demonstration. F. Potential Problems with Regression Analysis


Under certain statistical assumptions, the regression procedure described in Chapter III will provide unbiased estimates of channeling impacts. The assumption on which unbiasedness depends is that the disturbance term representing the unobserved factors affecting outcomes be uncorrelated with the screen/baseline control variables and treatment status. This condition is not definitely verifiable, but the fact that sample members were randomly assigned to treatment and control groups makes it unlikely that the disturbance term is correlated with treatment status; hence, estimates of channeling impacts obtained by regression are expected to be unbiased.

Unbiasedness is not the only desirable property of the estimates, however. When outcome variables are not normally distributed, regression estimates lose some of their other desirable properties and may exhibit other characteristics that are undesirable. Two types of channeling outcome variables that had non-normal distributions were those that were binary or truncated at zero, and those that were skewed (i.e., that had extremely large values for a small number of observations). Analyses were conducted to determine whether the regression estimates of impacts on these two types of outcomes were distorted or less reliable in some way than alternative estimates.

1. The Validity of Regression Estimates of Channeling Impacts for Binary and Truncated Dependent Variables

Estimates that are unbiased are known to be accurate on average; however, we also want impact estimates that in any particular instance are unlikely to deviate greatly from true impacts. The smaller the variance of the estimates, the narrower the confidence intervals around the estimates and the lower the probability of failing to detect important channeling impacts. However, the requirement for regression estimates to have minimum variance--homoscedasticity of the disturbance terms--will not be met for many of the dependent variables examined in the channeling evaluation because they are binary (e.g., whether admitted to a nursing home) or bounded at zero (e.g., number of days spent in the hospital). Furthermore, if the disturbance term is not homoscedastic, the test statistics calculated by the regression program will not be strictly correct. Finally, the predicted value for some observations may be less than zero when regression is used for binary or bounded dependent variables, which is obviously inappropriate. (Predicted values may also be greater than one, which is equally inappropriate for binary variables.)

For cases such as these, econometric procedures have been developed to provide estimates with desirable properties (under certain assumptions). Probit and logit models are the estimation procedures most widely used for binary dependent variables and Tobit analysis is used by economists for bounded variables. (See Maddala, 1983, for a discussion of these procedures, their statistical properties, and the assumptions on which they are based.) In practice however, these more complex and expensive estimation procedures typically provide estimates of the effects of explanatory variables on dependent variables which closely resemble in size and significance the estimated effects obtained from least squares regression. This result has been demonstrated in several previous applied studies (Corson et al., 1985; Grossman, et al., 1986; Hollister, et al., 1985; and others) as veil as in the recent econometric literature (Greene, 1981, 1983). Furthermore, all of the statistical properties of the probit and Tobit estimators, including unbiasedness, depend on the assumption that the disturbance term is normally distributed, a condition not required by regression.

The much greater ease with which statistical tests can be performed with least squares regression and the much lower computational cost compared to probit, logit, and Tobit (which require iterative maximum likelihood estimation) led us to strongly prefer least squares as an estimation strategy. However, to ensure that computational ease and cost savings were not achieved at the cost of seriously distorted impact estimates or test statistics, we compared estimates of channeling impacts obtained from regression to estimates obtained from the more complex procedures, using key outcome variables that were binary or truncated at zero.32

Comparison of the probit model estimates to least squares estimates for binary dependent variables.33 The probit model is based on the assumptions that individuals will take a given action (e.g., enter a nursing home) when a certain unobserved threshold is reached, that this threshold is determined by observed and unobserved factors, and that the threshold differs across individuals. Consider, for example, the decision to enter a nursing home. The'probit model for this outcome is written as:

 Y* =   ao + aBTB + aFTF + asS + axX - e 
 Y =   1 if Y* > 0 
 0 if Y* < 0. 

where Y* is the unobserved indicator of the propensity to enter a nursing home, which depends on the set of variables specified as explanatory variables in the standard regression equation given in Chapter III. The disturbance term e is the unobserved individual-specific threshold, for example, the individual's unwillingness to enter nursing homes.34 Sample members whose unmet need for services is so great that it outweighs their distaste for nursing homes are assumed to enter such institutions (given the availability of beds). The observed binary dependent variable (Y) is equal to 1 for those who enter nursing homes and 0 for those who do not. The parameters of this probit model (the ai’s) are estimated by maximum likelihood, i.e., by choosing the values that maximize the product of predicted probabilities of entering a nursing home (for actual entrants) or not entering (for nonentrants). Predicted probabilities from this model will always be between zero and one, and if the assumed model is correct, the resulting estimates have the minimum variance possible. The estimated impacts of channeling are obtained by computing the predicted probability of entering for a treatment group member, with all of the other characteristics X set at the sample mean, and subtracting the predicted probability for controls computed at the same values of X.

TABLE IV.2. Impact Estimates from Least Squares Regression and from Probit for Selected Binary Outcome Measures
(In percentage points; t-statistics in parentheses)
  Basic Model Financial Control Sample Size
Regression Probita Regression Probita
Whether Received Any Formal Care - 6 months 6.96** (3.49) 7.35** (3.49) 16.31** (8.09) 17.23** (8.12) 4,974
Whether Had Any Visiting Informal Caregiver - 6 months -2.33 (-1.22) -2.34 (1.18) -2.57 (-1.33) -2.77 (-1.33) 4,899
Whether Received Any Informal Care - 6 months -2.97 (-1.50) -3.12 (-1.44) -2.64 (-1.32) -2.92 (-1.38) 4,899
Whether Received Comprehensive Case Management - months 1-6     51.17**     (26.33)     52.67**     (26.44)     56.34**     (28.93)     58.35**     (29.36)     3,955  
Whether Admitted to Hospital -
   months 1-6 -2.80 (-1.44) -2.93 (-1.47) 2.04 (1.04) 2.12 (1.07) 5,554
   months 7-12 -0.36 (-0.20) -0.43 (-0.23) 0.37 (0.20) 0.48 (0.26) 5,554
Whether Admitted to Nursing Home -
   months 1-6 -0.52 (-0.37) -0.20 (-0.15) -0.37 (-0.27) -0.16 (-0.12) 4,593
   months 7-12 -2.23 (-1.88) -2.22 (-1.93) 0.29 (0.25) 0.40 (0.36) 4,752
NOTE: Regression estimates and sample sizes do not in all cases correspond exactly with those presented in final channeling reports, because some changes may have taken place between the time that this analysis was conducted and the final analyses were completed.
  1. Estimates of channeling impacts were obtained from the probit coefficients by computing the predicted probability of the dependent variable for treatments and for controls (with all of the explanatory variables set at their overall sample means) and subtracting. Thus, impact = F(Xb + a) - F (Xb), where F is the cumulative normal distribution function, X is the mean of the explanatory variables for treatments and controls combined, b is the vector of estimated probit coefficients on the explanatory variables, and “a” is the estimated probit coefficient on the treatment status indicator. The standard error of this difference was then calculated using the usual formula for approximating the variance of a nonlinear a combination of estimators. (Kmenta, 1971; p. 444). The t-statistics is simply the ratio of the estimated impact to the estimated standard error of the impact.

** Significantly different from zero at the .01 level (2-tailed test).

The least squares and probit estimates of channeling impacts on a set of key binary outcome variables are compared in Table IV.2. The impact estimates and t-statistics were very similar for all six of the variables examined, for both models. For no outcome was there a change in the statistical significance when probit was used. Even estimates that were statistically insignificant exhibited only small changes in magnitude.

Comparison of Tobit estimates to least squares regression estimates. When the dependent variable is truncated at zero but not binary, such as nursing home expenditures or days, regression estimates lose some of their desirable properties. The Tobit procedure, which is closely related to the probit procedure, was designed to overcome these weaknesses. A Tobit model of the number of days spent in nursing homes, for example, would be written as:

 Y* =   ao + aBTB + aFTF + asS + axX - e 
 Y =   Y* if Y* > 0 
 0 if Y* < 0. 

where observed nursing home days (Y) is equal to the expression given for Y* for individuals whose need for nursing home care outweighs their unobserved unwillingness to enter nursing homes (e), and equal to zero for others. Again, maximum likelihood methods are used to estimate the coefficients and the standard error of e. The effects of channeling are estimated by computing the expected value of the outcome Y for treatments and for controls, both at the point of means of the other explanatory variables, and taking the difference. (See Moffitt and McDonald, 1980, for the correct expression for obtaining predicted outcomes from Tobit models.)

The regression and Tobit estimates of channeling impacts on a set of key outcome variables that are bounded at zero are contained in Table IV.3. For most of the 24 comparisons, the differences between the two alternative estimates were quite small (though somewhat greater than the differences observed between probit and regression). However, in 3 instances, the differences were fairly large and resulted in a change in the statistical significance of the impact estimates: hours of formal care at 6 and 12 months in the basic model and nursing home expenditures at 6 months in the basic model. The impact of channeling on formal care in the basic model went from essentially zero using the regression model to nearly 1 hour per week at 6 months (about 15 percent of the control group mean) using the Tobit model, with the latter being statistically significant at the .05 level. The same change in statistical significance occurred at 12 months for this outcome in the basic model, although the two estimates were not that different in magnitude. The effect on nursing home expenditures went in the opposite direction. The regression estimate was a reduction of 165 dollars (about 25 percent of the control group mean), which dropped to 47 dollars when Tobit was used.

TABLE IV.3. Impact Estimates from Least Squares Regression and from Tobit for Selected Truncated Outcome Measures
(t-statistics in parentheses)
  Basic Model Financial Control Sample
Regression Tobita Regression Tobita
Hours of Formal Care
6 Months: impact 0.14 (0.22) 0.92* (2.00) 5.35** (8.15)   5.09**   (9.99) 4,974
control meanb   6.4   6.2   4.8   6.3    
12 Months:   impact 1.14 (1.78)   1.46**   (3.38)   3.58**   (5.56) 3.39** (6.62) 5,040
control mean 5.2   4.8   4.5   6.0    
Hours of Informal Care
6 Months: impact   -0.98     (-1.29)     -0.74     (-1.42)     -0.31     (-0.41)   -0.59   (-1.02)     4,899  
control mean 6.02   6.27   6.31   7.08    
12 Months: impact -0.03 (-0.04) 0.08 (0.21) 0.07 (0.12) -0.29 (-0.62) 4,998
control mean 3.69   3.96   4.56   5.15    
Hospital Days
6 Months: impact -0.35 (-0.41) -0.59 (-0.83) -0.71 (-0.83) -0.00 (-0.01) 5,554
control mean 11.5   12.8   16.2   14.3    
12 Months: impact -0.18 (-0.25) -0.20 (-0.33) -0.56 (-0.75) -0.20 (-0.33) 5,554
control mean 7.0   8.1   9.0   8.6    
Nursing Home Days
6 Months: impact -2.36 (-1.93) -0.59 (-0.67) -1.14 (-0.94) -0.27 (-0.33) 4,593
control mean 12.2   6.4   9.6   5.6    
12 Months: impact -1.19 (-0.63) -2.56 (-1.59) -2.19 (-1.15) -0.02 (-0.02) 4,752
control mean 16.3   12.8   16.7   10.1    
Hospital Expenditures
6 Months: impact -119 (-0.45) -206 (-0.94) -68 (-0.25) 89 (0.36 5,554
control mean 3,412   3,869   4,899   4,643    
12 Months: impact 59 (0.29) -11 (-0.06) -161 (-0.79) -63 (-0.34) 5,554
control mean 2,015   2,307   2,706   2,641    
Nursing Home Expenditures
6 Months: impact -165* (2.15) -47 (-0.92) -8 (-0.11) 6 (0.12) 4,593
control mean 666   369   560   332    
12 Months: impact -58 (-0.56) -120 (-1.42) -103 (-0.99) 1 (0.01) 4,752
control mean 819   657   894   546    
NOTE: Regression estimates and sample sizes do not in all cases correspond exactly with those presented in final channeling reports, because some changes may have taken place between the time that this analysis was conducted and the final analyses were completed.
  1. Estimates of channeling impacts were obtained from the tobit coefficients by computing the predicted value of the outcome variable for treatments and controls (with all of the explanatory variables set at their overall sample means) and subtracting. Using the expression given by Moffitt and McDonald (1980) for the expected value of the dependent variable in a tobit model, the estimated impact was:

    Impact = (X with line above itb + a) * F((X with line above itb + a)/a) = s*f((X with line above itb + a)/s) ] - [X with line above itb * F(X with line above itb/s) + s*f(X with line above itb/s ],

    where X with line above it is the mean of the explanatory variables for the treatment and control groups combined; b and a are the estimated tobit coefficients on the explanatory variables and treatment status indicators respectively; s is the estimated standard error of the disturbance term in the tobit model; f(.) is the standard normal density function; and F(.) is the cumulative distribution function of the standard normal (the predicted probability that the dependent variable is greater than zero). The standard error of the estimated impact was calculating using the usual formula for approximating the variance of a nonlinear combination of estimators (Kmenta, 1971: p 444). The t-statistic (in parentheses) is simply the ratio of the estimated impact to the estimated standard error of the impact.

* Significantly different from zero at the .05 level.
** Significantly different from zero at the .01 level.

Despite these differences, it was not clear that the Tobit procedure produced better estimates than regression even in these two instances. The predicted nursing home expenditures for controls was far below the actual mean, suggesting that Tobit may not have provided reliable estimates. Furthermore, for both the variables for which least squares and Tobit produced substantially different estimates there was evidence that the Tobit estimates reflected the probability of any use of these services more strongly than the extent of use. Both of these problems were due to outliers, cases with extremely large values of the outcome variable, which affect Tobit estimates somewhat differently than least squares estimates. Although less sensitivity to outliers would be a desirable feature, the distorting effects of outliers on Tobit estimates may be even greater than their effects on least squares estimates, especially if there are treatment/control differences in the number of outliers. These potential problems, combined with the greater expense and difficulty of hypothesis testing with the Tobit model, again led us to prefer least squares regression as the estimation procedure, and to analyze the effects of outliers on these estimates directly.

2. The Effects of Outliers on Regression Estimates of Channeling Impacts

The effects of outliers (i.e., extremely large values of the outcome variable that are not simply data errors) on estimates of population means and regression coefficients are well-known, but there is much less documentation about what should be done when confronted by such problems. A common "solution", discarding the outliers, may distort estimates of program impacts more than leaving them in, since one of the effects of the program may be to reduce extreme use of or expenditures on services. This effect would be totally missed if outliers are discarded. However, it may be the case that differences between the two groups in the very small proportion of outliers could arise strictly by chance and affect the estimated treatment/control difference so greatly that it no longer provides a reliable estimate of channeling impacts.

Duan et al. (1983) cite examples of -how even estimates which are unbiased can yield very misleading inferences about program impacts in cases where the outcome variable is zero for a substantial fraction of the sample but has extremely large values for a small fraction of the remaining cases. They then propose an alternative estimator for such situations. This procedure seemed potentially appropriate for the channeling evaluation, since several of the key outcome variables exhibit these characteristics, especially hospital and nursing home days and expenses.

The procedure advocated, by Duan et al. is to break such service use variables (measured either in physical units or expenditures) into two separate variables: whether the service is used at all, and for those who use it, the amount of such services. The expected value of use is the product of the probability of use and the expected amount of use given that some occurred. Thus, a probit model is estimated first for whether any use occurred, as a function of treatment status and other explanatory variables. Then, using only observations that had some service use, a regression model is estimated to predict the amount of use (again dependent on treatment status and control variables), with the amount being expressed in logarithmic form to reduce the influence of outliers on the estimates. These two equations are then used to obtain predicted probabilities of use and amounts of use by service users for treatments and for controls with the same characteristics. These estimates in turn are used to compute overall expected use for the treatment and control groups and the difference between them.

This procedure was used on a set of key hospital and nursing home outcome variables with skewed distributions. Table IV.4 contains a comparison of the 2-part, least squares and Tobit estimates of channeling impacts. The 2-part method yielded estimates which differed somewhat from the regression estimates, but not by enough to change the inference about whether channeling affected hospital and nursing home outcomes. The 2-part estimates were also generally closer to the least squares estimates than to the Tobit estimate, especially for the outcomes exhibiting the largest discrepancy between least squares and Tobit.

These results suggested that the more cumbersome two-part method was not necessary, at least for hospital and nursing home outcomes where outliers were most likely to occur. However, the results from the Tobit analysis suggested that estimates of channeling impacts on hours of formal care received at 6 months was also affected by outliers. To investigate this, the 2-part method was used for this outcome variable as well. In the financial control model, estimated impacts from least squares and the 2part methods were both large and statistically significant. In the basic model, however, the estimated impact from regression was small (.14 hours) and not statistically significant, but the 2-part method estimate was much larger (2.5 hours) and the impact on both the probability of receiving care and the amount of care received by service recipients were statistically significant.

The nonsignificant effect on hours was unexpected because other estimates indicated that the basic model led to an increased proportion of sample members receiving any services. Thus, to have no effect on hours channeling would have had to decrease the average amount of services received by those who would have received some services even in channeling's absence. Further examination of the data showed that the small regression estimate of treatment/control differences was heavily influenced by the receipt of continuous (24 hours per day) formal care by 7 control group members (representing 20 percent of total use by the 1,000 controls in the sample) but only 2 treatment group members. Use of the 2part method dampened the effect of these outliers on the estimated treatment/control difference, and completely reversed the inference about channeling's effects on the average amount of care received by recipients. The estimate in column 7 of Table IV.4 indicates that treatment group recipients received significantly (2.8) more hours of care than recipients in the control group.

TABLE IV.4. Comparison of Least Squares, Tobit, and 2-Part Estimates of Channeling Impacts for Skewed Outcome Variables
Outcome Alternative Estimates of Impacts   Control  
Components of 2-Part Method Estimate   Sample  
  Tobit   Least
Probability of Use Quantity of Users
  Impact     Control  
  Impact     Control  
6 Month Outcomes
Hospital Days
   Basic -0.59 -0.35 -0.74 11.5 -0.024 0.539 -0.4 22.19 5,554
   Financial Control   0.00 -0.71 -0.77 16.2 0.018 0.546 -2.3 29.03  
Hospital Expenditures
   Basic -206 -119 -227 $3,412 -0.024 0.539 -131 6,632 5,554
   Financial Control 89 -68 -178 $4,889 0.018 0.546 -596 8,813  
Nursing Home Days
   Basic -0.59 -2.36 -2.42 12.2 -0.004 0.113 -19.2* 81.30 4,593
   Financial Control -0.27 -1.14 -0.08 9.6 0.001 0.107 -1.4 68.37  
Nursing Home Expenditures
   Basic -47 -165* -131 $666 -0.004 0.113 -1035 4,521 4,593
   Financial Control 6 -8 -30 $560 -0.001 0.107 -320 4,158  
Hours of Formal Care
   Basic 0.92* 0.14 2.50* 6.50 0.074** 0.400 2.82* 16.24 4,974
   Financial Control 5.09** 5.35** 8.41** 5.02 0.172** 0.474 10.20** 10.60  
6 Month Outcomes
Hospital Days
   Basic -0.20 -0.18 0.40 7.0 -0.005 0.339 1.5 21.06 5,554
   Financial Control -0.20 -0.56 -0.44 9.0 -0.0003 0.350 -1.2 25.17  
Hospital Expenditures
   Basic -11 59 139 $2,015 -0.005 0.339 506 6,079 5,554
   Financial Control -63 -161 -132 $2,706 -0.0003 0.350 -370 7,597  
Nursing Home Days
   Basic -2.56 -1.19 -0.78 16.3 -0.025 0.129 19.3 111.41 4,752
   Financial Control -0.02 -2.19 -2.43 16.7 0.004 0.103 -27.5 128.66  
Nursing Home Expenditures
   Basic -120 -58 -4 $819 -0.025 0.129 1,345 5,757 4,752
   Financial Control 1 -103 -124 $894 0.004 0.103 -1,420 6,910  
  1. The impact estimate obtained from the two-part method was calculated as follows:

    Impact = (proportion of control group with Y > 0 + estimated channeling impact on proportion)
    * (average value of Y for control group members with Y > 0 + estimated impact on Y for those with Y > 0)
    - (proportion of controls with Y > 0) * (average Y for controls with Y > 0).

    where Y is the value of the outcome variable examined. The impact on the proportion for sample members with Y > 0 was estimated from a probit model. The impact on outcomes for those with Y > 0 was estimated by first regressing the logarithm of the outcome variable on binary treatment indicators and the standard control variables, using only those cases with Y > 0. The coefficients (b) on the treatment status variables from this log regression were then used to calculate impacts on expenditures:

    Impact on those with Y > 0 = (eb - 1) = (control group mean for those with Y > 0).

    These four components used to construct the overall impact are presented in columns 5 through 8 of this table.

* Significantly different from zero at the .05 level (2-tailed test).
** Significantly different from zero at the .01 level (2-tailed test).

Given the similarity of the 2-part estimates to the ordinary least squares regression estimates for nursing home and hospital days and expenditures, the final reports on these outcomes relied upon the ordinary regression results. This was done because the standard errors of impacts from the 2-part method are more cumbersome to calculate, and multivariate tests would be especially difficult to conduct. Even for hours of formal care, we chose in the final reports to rely on least squares estimates (computed both with and without the outliers), despite the fact that the 2-part method did yield estimates that were less sensitive to outliers than the ordinary least squares estimates. The reason for this decision was that if channeling did in fact reduce the service use of a small number of cases who would otherwise have used large amounts of services, the savings from such effects could be very substantial. The two-part method may understate the importance of such cases.

The 2-part method therefore may never give the most appropriate estimates. If important channeling effects occur for outliers, the two-part method may mask them. On the other hand, if treatment/control differences in outliers were due strictly to chance, the optimal approach is to drop them, rather than to just reduce their influence. Thus, throughout the evaluation, least squares regression was used to estimate channeling impacts. As shown in Table IV.4, this yields the same inferences about impacts on hospital and nursing home outcomes as the 2-part method. For formal care at 6 months, impacts were estimated in the final report with outliers included and then with them excluded. Evidence was presented indicating which estimates provided the most accurate indication of channeling impacts. (See Corson et al., 1986 for further discussion of those results.) No other outcome measures appeared to have skewed distributions; hence, no other analyses of the effects of outliers were conducted.

View full report


"methodes.pdf" (pdf, 2.16Mb)

Note: Documents in PDF format require the Adobe Acrobat Reader®. If you experience problems with PDF documents, please download the latest version of the Reader®