In Chapter II, we discussed the importance of narrowing or prioritizing the list of research questions that an evaluation is intended to answer. This is particularly important in sample design, since a sample that is designed to provide precise estimates of one outcome may be very weak for other outcomes. To build a sample that can answer the key research questions, it is important to determine the key outcome (or, at most, a handful of key outcomes) the evaluation is seeking to address, the level of variation in that outcome, and the expected magnitude of the impact on that outcome.

In most welfare reform evaluations, four key outcomes are the focus of the impact analysis: (1) the proportion of cases on cash assistance, (2) the mean benefit per case, (3) the proportion of cases with someone working, and (4) the mean earnings per case. Of these four outcomes, those that policymakers consider particularly important should be the focus of the sample design. If all four are of roughly equal importance (as often happens), the most conservative strategy is to focus on the outcome for which the relevant impact is likely to be hardest to detect (that is, the outcome that requires the largest sample to detect a statistically significant impact). The two factors that determine the ease of detecting an impact for a particular outcome are (1) the variance of the outcome (which affects the variance of the impact estimate), and (2) the likely magnitude of the impact.

Among the four outcomes, earnings is likely to have the largest variance relative to the mean, and thus to require the largest sample size to detect an impact of a certain proportion; therefore, in many cases, samples are most conservatively designed to detect impacts on mean earnings. The likely magnitude of the impact also is important, however. In many past employment-training demonstrations, the proportionate impact on AFDC benefits tended to be smaller than the proportionate impact on earnings (Gueron and Pauly 1991). If a key goal is to be able to detect even a small impact on cash assistance benefit levels, that outcome may be the appropriate focus of the sample design.

A sample well designed for assessing impacts on these key outcomes may be weak for assessing other types of impacts. For example, the terms and conditions have required many states to assess the impacts of welfare reform on Medicaid paid claims. Because Medicaid paid claims vary extensively in the population (as some individuals have very high medical costs, but most have low costs), even large average experimental-control differences in Medicaid claims may not be statistically significant, with a sample designed primarily to estimate impacts on earnings.

Regression adjustment of impact estimates for baseline characteristics reduces the standard error of the impact estimates slightly (and thus, in principle, the sample size needed to detect a certain difference). Of the random-assignment evaluations reviewed here, only the Minnesota MFIP evaluation took into account the role of regression adjustment in determining desired sample size.