Staff members at DHHS typically have specified sample sizes for welfare reform evaluations in the waiver terms and conditions after detailed discussions with the state. These sample size requirements vary according to the state's evaluation objectives and the size of the population being studied. The usual minimum requirements, however, are for the control group to include 2,000 recipient cases and 2,000 approved applicant cases and for the experimental group to be one to two times as large. States may exceed these minimum requirements to improve the precision of their estimates. Usually, sample size requirements do not include specific sample size goals for subgroups. States are required to sample all applicants (not just approved applicants) if the intervention affects eligibility for AFDC, but the sample size requirement is still generally phrased in terms of approved applicants. Thus, the federal requirements generally imply larger sample size requirements when the intervention affects eligibility.(4) Despite this federal guidance, the five evaluations reviewed for this study varied greatly in their planned sample sizes (overall and for key subgroups), as well as in the goals, assumptions, and precision standards used to justify these sample sizes.

a. Planned Sample Sizes

Table III.1 TABLE III.1 PLANNED SAMPLE SIZES IN FIVE STATE WAIVER EVALUATIONS summarizes planned sample sizes in the five evaluations. We first review these planned samples and the data, assumptions, and precision standards used to justify them; later, we discuss how well actual sampling experience has accorded with plans.
Wisconsin. Wisconsin's WNW (the only nonexperimental evaluation) had a planned sample size of 4,000 cases in the demonstration counties and at least 4,000 in the comparison counties (for the part of the evaluation based on a comparison county design). The sample of 4,000 in the demonstration counties was expected to consist of 1,000 recipient cases (the full caseload in those counties) and 3,000 applicants (all applicants over a sevenyear period).
The evaluation plan prepared by MAXIMUS discusses the adequacy of the sample size in the WNW evaluation in terms of Cohen's "effect size" measure, defined as the impact on an outcome divided by the standard deviation of the outcome (Cohen 1977).(5) A table shows the sample size needed to detect various effect sizes for onesided tests with a .05 significance level, at levels of power ranging from 50 to 99 percent. The text notes that a sample of 4,000 each in the demonstration and comparison groups is more than adequate to detect the smallest shown effect size (.10 or 10 percent of the standard deviation of the outcome) at the highest level of power.(6) It is difficult to assess, however, whether an effect size of .10 is realistic for the outcomes being considered without more information. Furthermore, the evaluation plan does not discuss whether the sample is sufficient for the applicant and recipient samples considered separately. The evaluation plan also mentions possibly increasing the size of the comparison group sample
to the full caseload in the comparison counties for outcomes easily measured in administrative data, as one way to add precision to the estimates. The effects of the nonexperimental nature of the evaluation on sample precision are not considered.
California. In California, the required sample size was 15,000 recipient cases (5,000 controls and 10,000 experimental group members). The required sample size for the approved applicant sample was specified as the sample over four years assuming that applicants are sampled using the same sampling rates as used for the recipient cases. The estimated sample of applicants outlined in the sampling plan was 17,280, consisting of 11,520 experimental cases and 5,760 controls.(7) Although we have not found any explicit analysis of precision in the California materials, the large overall sample appears to have been intended to permit subgroup analyses (see Section A.2.b).
Among the five state evaluations, only California planned on unbalanced sample sizes for the two research groups, with two experimental group members for every control group member. Because the demonstration counties had caseloads much larger than twice the control group sample, including additional experimental cases was more feasible than it would have been with smaller sites.(8) The larger experimental group improves the precision of the estimates.
Colorado. In Colorado, the terms and conditions require the following samples: (1) recipients2,000 experimental and 2,000 control cases, and (2) approved applicants2,000 experimental and 2,000 control cases. The planned sample sizes described in the evaluation plan are: (1) recipients2,034 experimental and 2,034 control cases, and (2) approved applicants 3,288 experimental and 3,288 control cases. The planned applicant sample was larger than required because the Colorado staff interpreted the sample size requirements in the terms and conditions as referring to the number of cases active two years after implementation. The Colorado sampling plan analyzes precision in terms of the minimum sample sizes needed for countylevel estimates but assumes applicant and recipient cases will be pooled for analysis. It does not make clear the need for countylevel precision or the rationale for pooling applicant and recipient cases (pooling is discussed further in Section B). The stated precision standard for the analysis is 95 percent power for a onetailed test; this precision standard is applied to an assumed reduction in recidivism to welfare from 30 to 15 percent.(9) The power requirement of 95 percent is higher than that typically used in evaluation research (80 percent is more common). In addition, recidivism to welfare is not really an appropriate outcome measure on which to base the power analysis, since it is an outcome that can only be defined for a nonrandom portion of the sample (cases that have already exited AFDC).(10)
Michigan. In the Michigan TSMF evaluation, the planned sample size was 21,95213,578 recipients and 8,374 applicant cases, evenly divided between experimental group members and control group members. The Abt proposal shows that this total sample is adequate to detect a 5 percent impact on earnings under the following assumptions: mean monthly earnings of $165 for controls, with a standard deviation of $244 (based on "a recent study of welfare recipients") and a precision standard for a onetailed test of a 5 percent significance level and 80 percent power. This calculation assumes no increases in variance due to stratification and ignores any reductions from regression adjustment of impact estimates. Again, the assumption seems to have been that applicants and recipients would be pooled.
Minnesota. The MFIP demonstration has four experimental groups and multiple strata; this substantially complicates the relevant power calculations (see Tables III.1 and III.2). Table III.2 presents the full design for the Minnesota sample. Probably because of the complex design of the demonstration, the terms and conditions of the MFIP evaluation have an explicit precision standard, unlike those in the
TABLE III.2PLANNED SAMPLE SIZES FOR THE MINNESOTA MFIP EVALUATION, BY SUBGROUP
other evaluations that we have reviewed. The terms and conditions state that samples must be adequate to detect experimentalcontrol differences in major outcomes equal to 20 percent of the standard deviation of the outcome at a 5 percent significance level with 80 percent power.
The MFIP evaluation design report argues that the proposed MFIP sample design can meet this standard in comparisons of any two experimental groups with 2,000 cases each, using the employment rate as the key outcome, a twotailed test, an assumed mean of 50 percent employed in the control group (which the authors say is consistent with other MDRC studies), and assumed gains from regression adjustment of the impact estimate equivalent to a regression equation with an Rsquared equal to .08 (which they also say is consistent with experience).(11) This calculation assumes pooling of applicant and recipient cases, and no increases in variance (often referred to as "design effects") due to stratification of the sample.(12) The two smaller research groups (E2 and C2) are roughly 2,000 cases each, but E2 is stratified by county. The larger groups (E1 and C1) are well above that level, but they were stratified by urban/rural location and (within these groups) into several other subgroups, with different sampling rates for the different subgroups (see the next subsection). The larger samples in groups E1 and C1 (over 6,000 in each) may balance or outweigh any design effects from stratification.


b. Subgroup Sample Sizes

Other than stratification of the sample between applicants and recipients (discussed in Section B), the only explicit stratifications of the sample in the five studies examined were by site (or grouping of sites, such as urban versus rural) and by singleparent versus twoparent cases. The motivation behind these stratifications generally was to allow more precise estimates for subgroups; the implications for precision of the estimates for subgroups and overall were not explicitly drawn out.
All of the evaluations (except Wisconsin, which is not really comparable because of its quasi experimental design) to some extent oversampled cases in smaller sites. In three instances, the motivation was to increase the precision for subgroup estimates; in one instance, it was to increase statewide representativeness:(13)
 In California, the sample was allocated across counties as follows: 40 percent to Los Angeles (roughly proportional to its relative caseload), and 20 percent each to the remaining three counties of Alameda, San Bernadino, and San Joaquin. This allocation substantially oversamples the latter county in particular. The goal of being able to measure sitespecific impacts justified this approach.
 Colorado sampled at a higher rate in smaller sites and sampled the full caseload in the smallest county included in the study. The goal of a minimum of 330 experimental and 330 control group cases in each county determined the sampling rates, with additional cases from the largest counties selected to meet the overall sample size goals (and to improve overall precision).
 In Michigan, the sample was selected from four offices: two in Wayne County (Detroit) and two in other parts of the state. The entire caseload in the two non Wayne offices was assigned to the research sample, but only 70 percent of the caseload in the two Wayne county offices was assigned to it. The motivation for this allocation appears to have been to make the sample more representative of the state as a whole, since the proportion of the sample from Wayne County thus resembled the proportion of the state caseload from Wayne County.
 Minnesota had an explicit stratification into urban versus rural sites: the full caseload was sampled in the rural sites, but not in the urban sites. The motivation for this allocation was to derive separate estimates for urban versus rural areas.
Two of the evaluations reviewed stratified explicitly by singleparent versus twoparent cases. California set up the sample so that onethird of cases sampled were twoparent cases (AFDCUP cases), although such cases typically make up less than 15 percent of the caseload. Minnesota also explicitly oversampled twoparent cases (including cases on the state general assistance program and AFDCUP), relative to their basic sample of singleparent cases in urban areas.(14) Again, no explicit power analyses were offered to justify these sample sizes, but the motivation was clearly to increase the precision of estimates for twoparent cases. This stratification seems sensible, since changes in rules for twoparent families were a major part of the reform packages in these states, and both states had relatively large sample sizes.
None of the evaluators appears to have considered the effects of oversampling of sites or other subgroups on the precision of the estimates for the overall research sample.


c. Planned Versus Actual Samples

The discussion so far has been of planned sample sizes in the five state waiver evaluations reviewed. At this time, it is apparent that actual samples in several of the states are not as large as planned.(15) This problem is discussed further in the next section; here, we note only that not meeting sample goals can seriously reduce the usefulness of an evaluation.
