Approaches to Evaluating Welfare Reform: Lessons from Five State Demonstrations. e. Nonexperimental Versus Experimental Design Requirements


In general, nonexperimental designs require larger samples than experimental designs for a given outcome measure. For example, suppose a design compares applicants to a welfare reform program in some counties with applicants to the current program in other counties. Suppose also that differences (other than the welfare reform program) between the demonstration and comparison groups could be completely controlled for using measured background characteristics. Even in this case, for a given sample size, the standard error of the regression- adjusted impact estimate would be larger than in an experimental evaluation because of correlations between the welfare reform site indicator and the background variables in the equation. Intuitively, the greater the extent to which variables are correlated (tend to move together), the larger the sample required to "sift out" their separate effects--in this case, to separate the impact of the program from the effects of other characteristics.(2)

The difficulty of sorting out program impacts from other factors is magnified if there are unobserved differences between the demonstration and comparison groups ("selection bias"). In the best of circumstances, these differences may be adjusted for using two-equation models.(3) In many such models, the first equation predicts membership in the treatment (demonstration) group (as a function of individual or site characteristics). The second equation estimates the effects of the program using predicted treatment status from the first equation rather than actual treatment status. Such models typically produce very imprecise impact estimates and therefore require much larger sample sizes to detect impacts of a given magnitude (Burghardt et al. 1985).

In a nonexperimental evaluation, however, it may be possible to limit the population of interest to those most likely to be affected by the reforms, so that the impact to be detected is easier to measure. For instance, many of the current waiver evaluations include provisions that affect program eligibility at initial application. States are thus required to randomly assign all AFDC applicants to an experimental or control group. A concern is that the applicant sample includes many applicants who would be denied AFDC benefits under both the new and old versions of the program (and who thus "dilute" estimates of program impacts). A nonexperimental design that compared only approved applicants under the old and new programs would be examining populations with much higher levels of AFDC participation. Thus, assuming the differences between the two groups could be adequately controlled--a big assumption--it would need smaller samples to detect given percentage impacts on participation.