Each evaluation must address several issues concerning what is an adequate sample size:
 Goals of the sample design, particularly the key outcomes to be measured
 Minimum precision standard for each goal
 Balance of the sample between the experimental or demonstration group and the control or comparison group
 Relative emphasis on overall impacts versus subgroup impacts
 Implications for sample size of a nonexperimental versus an experimental design

a. Outcomes to Be Measured

In Chapter II, we discussed the importance of narrowing or prioritizing the list of research questions that an evaluation is intended to answer. This is particularly important in sample design, since a sample that is designed to provide precise estimates of one outcome may be very weak for other outcomes. To build a sample that can answer the key research questions, it is important to determine the key outcome (or, at most, a handful of key outcomes) the evaluation is seeking to address, the level of variation in that outcome, and the expected magnitude of the impact on that outcome.
In most welfare reform evaluations, four key outcomes are the focus of the impact analysis: (1) the proportion of cases on cash assistance, (2) the mean benefit per case, (3) the proportion of cases with someone working, and (4) the mean earnings per case. Of these four outcomes, those that policymakers consider particularly important should be the focus of the sample design. If all four are of roughly equal importance (as often happens), the most conservative strategy is to focus on the outcome for which the relevant impact is likely to be hardest to detect (that is, the outcome that requires the largest sample to detect a statistically significant impact). The two factors that determine the ease of detecting an impact for a particular outcome are (1) the variance of the outcome (which affects the variance of the impact estimate), and (2) the likely magnitude of the impact.
Among the four outcomes, earnings is likely to have the largest variance relative to the mean, and thus to require the largest sample size to detect an impact of a certain proportion; therefore, in many cases, samples are most conservatively designed to detect impacts on mean earnings. The likely magnitude of the impact also is important, however. In many past employmenttraining demonstrations, the proportionate impact on AFDC benefits tended to be smaller than the proportionate impact on earnings (Gueron and Pauly 1991). If a key goal is to be able to detect even a small impact on cash assistance benefit levels, that outcome may be the appropriate focus of the sample design.
A sample well designed for assessing impacts on these key outcomes may be weak for assessing other types of impacts. For example, the terms and conditions have required many states to assess the impacts of welfare reform on Medicaid paid claims. Because Medicaid paid claims vary extensively in the population (as some individuals have very high medical costs, but most have low costs), even large average experimentalcontrol differences in Medicaid claims may not be statistically significant, with a sample designed primarily to estimate impacts on earnings.
Regression adjustment of impact estimates for baseline characteristics reduces the standard error of the impact estimates slightly (and thus, in principle, the sample size needed to detect a certain difference). Of the randomassignment evaluations reviewed here, only the Minnesota MFIP evaluation took into account the role of regression adjustment in determining desired sample size.


b. Precision Standard

The needed sample size also depends on the level of precision at which the impact is to be measured. The precision standard for a sample design is determined by three factors: (1) the desired level of statistical significance for the impact estimate, (2) the power of the sample design (the probability of detecting the desired effect), and (3) whether a onesided or a twosided hypothesis test is used. A result is referred to as statistically significant if the probability of the true impact being zero, given the estimated impact and its standard error, is very lowgenerally 10 percent or less (typical standards are 10 percent, 5 percent, or 1 percent). For a given size impact, the smaller the standard error, the more statistically significant the estimate; larger sample sizes are thus required to detect an effect at the 1 percent level of significance than at the 5 percent level. The power of the design is the probability of detecting an effect, assuming an effect of a given size is presentfor example, if the design has 80 percent power to detect a 5 percentage point impact at a 5 percent significance level, then, assuming the true impact of the program is 5 percentage points, the probability that a statistically significant impact will be observed is 80 percent. The larger the sample size, the higher the power of the sample to detect impacts of a given size and significance level.
Most evaluation research uses twosided hypothesis tests, under the assumption that it is useful to distinguish effects in the desired or the unintended direction from policies with no effect. Bloom (1995) argued that onesided tests may be adequate for most evaluations, since the key concern is to distinguish whether a policy had the desired effect or not. The advantage of one sided tests is that smaller sample sizes are needed than in twosided tests to achieve a given level of power and statistical significance.


c. Sample Balance

Dividing the sample into equal numbers of experimental (demonstration) and control (comparison) cases (this is referred to as a "balanced" design) leads to estimates with the highest level of precision, for a given total sample size.
(1) However, substantial deviations from this balance may occur with only minor losses in precision (Bloom 1995). States may prefer an unbalanced sample because of a desire to implement the reform program as completely as possible (if the reforms are implemented statewide for all cases except control cases). By having the minimum allowed number of control cases but more experimental cases, states can increase sample precision while keeping the control group as small as possible. Thus, in many evaluations in which the intervention is implemented for everyone except the control group, the sample is designed to include two experimental group members for every control group member. Increasing the ratio of experimentals to controls beyond 2:1, for a fixed total sample size, leads to more substantial loss in precision. Increasing the total sample size by adding additional experimental group members (but keeping the control group sample the same) increases precision only slightly.


d. TradeOffs Between Subgroup Analysis and FullSample Analysis

Oversampling of key subgroups allows the evaluation to obtain more precise estimates of program impacts for the subgroups of interest. However, such oversampling (if total sample size is held constant) also reduces the precision of the estimates of impacts on the full sample. This becomes less of a concern if there are enough resources to have larger than minimum sample sizes overall, since the increase in precision from having a larger sample will at least partly balance the loss in precision from stratification.
For example, suppose subgroups are defined as the individual demonstration sites. Samples may be allocated across the sites in three ways:
 No Stratification. If the population about which inferences are to be made is the caseload in the research sites only, sampling rates should be the same in all the sites, and the sample sizes in the sites should be proportional to the number of cases in those sites.
 Stratification to Increase Precision of SiteLevel Impact Estimates. To make inferences about impacts in specific sites as well as the entire group of research sites, sample sizes should be set to balance the precision needs of the two types of estimates. In general, cases in the smaller sites will be oversampled in relation to cases in the larger sites. It still may be desirable to have larger samples in larger sites, however, to increase precision of the overall estimates, as long as the samples in the smaller sites meet a minimum standard for sitelevel precision.
 Stratification to Increase StateLevel Representativeness. If the population about which inferences is to be made is the entire state caseload, the sampling process is appropriately conceived of as a twostage sampling process, in which sites are selected first, then cases within sites. Such a design could, in principle, lead to oversampling of either large or small sites. In this setting, implications for precision are most appropriately evaluated in the context of the state as a whole.
These same three approaches can be applied to determining sample sizes for other subgroups.


e. Nonexperimental Versus Experimental Design Requirements

In general, nonexperimental designs require larger samples than experimental designs for a given outcome measure. For example, suppose a design compares applicants to a welfare reform program in some counties with applicants to the current program in other counties. Suppose also that differences (other than the welfare reform program) between the demonstration and comparison groups could be completely controlled for using measured background characteristics. Even in this case, for a given sample size, the standard error of the regression adjusted impact estimate would be larger than in an experimental evaluation because of correlations between the welfare reform site indicator and the background variables in the equation. Intuitively, the greater the extent to which variables are correlated (tend to move together), the larger the sample required to "sift out" their separate effectsin this case, to separate the impact of the program from the effects of other characteristics.(2)
The difficulty of sorting out program impacts from other factors is magnified if there are unobserved differences between the demonstration and comparison groups ("selection bias"). In the best of circumstances, these differences may be adjusted for using twoequation models.(3) In many such models, the first equation predicts membership in the treatment (demonstration) group (as a function of individual or site characteristics). The second equation estimates the effects of the program using predicted treatment status from the first equation rather than actual treatment status. Such models typically produce very imprecise impact estimates and therefore require much larger sample sizes to detect impacts of a given magnitude (Burghardt et al. 1985).
In a nonexperimental evaluation, however, it may be possible to limit the population of interest to those most likely to be affected by the reforms, so that the impact to be detected is easier to measure. For instance, many of the current waiver evaluations include provisions that affect program eligibility at initial application. States are thus required to randomly assign all AFDC applicants to an experimental or control group. A concern is that the applicant sample includes many applicants who would be denied AFDC benefits under both the new and old versions of the program (and who thus "dilute" estimates of program impacts). A nonexperimental design that compared only approved applicants under the old and new programs would be examining populations with much higher levels of AFDC participation. Thus, assuming the differences between the two groups could be adequately controlleda big assumptionit would need smaller samples to detect given percentage impacts on participation.
