Approaches to Evaluating Welfare Reform: Lessons from Five State Demonstrations. a. Planned Sample Sizes


Table III.1 TABLE III.1 PLANNED SAMPLE SIZES IN FIVE STATE WAIVER EVALUATIONS summarizes planned sample sizes in the five evaluations. We first review these planned samples and the data, assumptions, and precision standards used to justify them; later, we discuss how well actual sampling experience has accorded with plans.

Wisconsin. Wisconsin's WNW (the only nonexperimental evaluation) had a planned sample size of 4,000 cases in the demonstration counties and at least 4,000 in the comparison counties (for the part of the evaluation based on a comparison county design). The sample of 4,000 in the demonstration counties was expected to consist of 1,000 recipient cases (the full caseload in those counties) and 3,000 applicants (all applicants over a seven-year period).

The evaluation plan prepared by MAXIMUS discusses the adequacy of the sample size in the WNW evaluation in terms of Cohen's "effect size" measure, defined as the impact on an outcome divided by the standard deviation of the outcome (Cohen 1977).(5) A table shows the sample size needed to detect various effect sizes for one-sided tests with a .05 significance level, at levels of power ranging from 50 to 99 percent. The text notes that a sample of 4,000 each in the demonstration and comparison groups is more than adequate to detect the smallest shown effect size (.10 or 10 percent of the standard deviation of the outcome) at the highest level of power.(6) It is difficult to assess, however, whether an effect size of .10 is realistic for the outcomes being considered without more information. Furthermore, the evaluation plan does not discuss whether the sample is sufficient for the applicant and recipient samples considered separately. The evaluation plan also mentions possibly increasing the size of the comparison group sample

to the full caseload in the comparison counties for outcomes easily measured in administrative data, as one way to add precision to the estimates. The effects of the nonexperimental nature of the evaluation on sample precision are not considered.

California. In California, the required sample size was 15,000 recipient cases (5,000 controls and 10,000 experimental group members). The required sample size for the approved applicant sample was specified as the sample over four years assuming that applicants are sampled using the same sampling rates as used for the recipient cases. The estimated sample of applicants outlined in the sampling plan was 17,280, consisting of 11,520 experimental cases and 5,760 controls.(7) Although we have not found any explicit analysis of precision in the California materials, the large overall sample appears to have been intended to permit subgroup analyses (see Section A.2.b).

Among the five state evaluations, only California planned on unbalanced sample sizes for the two research groups, with two experimental group members for every control group member. Because the demonstration counties had caseloads much larger than twice the control group sample, including additional experimental cases was more feasible than it would have been with smaller sites.(8) The larger experimental group improves the precision of the estimates.

Colorado. In Colorado, the terms and conditions require the following samples: (1) recipients--2,000 experimental and 2,000 control cases, and (2) approved applicants--2,000 experimental and 2,000 control cases. The planned sample sizes described in the evaluation plan are: (1) recipients--2,034 experimental and 2,034 control cases, and (2) approved applicants-- 3,288 experimental and 3,288 control cases. The planned applicant sample was larger than required because the Colorado staff interpreted the sample size requirements in the terms and conditions as referring to the number of cases active two years after implementation. The Colorado sampling plan analyzes precision in terms of the minimum sample sizes needed for county-level estimates but assumes applicant and recipient cases will be pooled for analysis. It does not make clear the need for county-level precision or the rationale for pooling applicant and recipient cases (pooling is discussed further in Section B). The stated precision standard for the analysis is 95 percent power for a one-tailed test; this precision standard is applied to an assumed reduction in recidivism to welfare from 30 to 15 percent.(9) The power requirement of 95 percent is higher than that typically used in evaluation research (80 percent is more common). In addition, recidivism to welfare is not really an appropriate outcome measure on which to base the power analysis, since it is an outcome that can only be defined for a nonrandom portion of the sample (cases that have already exited AFDC).(10)

Michigan. In the Michigan TSMF evaluation, the planned sample size was 21,952--13,578 recipients and 8,374 applicant cases, evenly divided between experimental group members and control group members. The Abt proposal shows that this total sample is adequate to detect a 5 percent impact on earnings under the following assumptions: mean monthly earnings of $165 for controls, with a standard deviation of $244 (based on "a recent study of welfare recipients") and a precision standard for a one-tailed test of a 5 percent significance level and 80 percent power. This calculation assumes no increases in variance due to stratification and ignores any reductions from regression adjustment of impact estimates. Again, the assumption seems to have been that applicants and recipients would be pooled.

Minnesota. The MFIP demonstration has four experimental groups and multiple strata; this substantially complicates the relevant power calculations (see Tables III.1 and III.2). Table III.2 presents the full design for the Minnesota sample. Probably because of the complex design of the demonstration, the terms and conditions of the MFIP evaluation have an explicit precision standard, unlike those in the


other evaluations that we have reviewed. The terms and conditions state that samples must be adequate to detect experimental-control differences in major outcomes equal to 20 percent of the standard deviation of the outcome at a 5 percent significance level with 80 percent power.

The MFIP evaluation design report argues that the proposed MFIP sample design can meet this standard in comparisons of any two experimental groups with 2,000 cases each, using the employment rate as the key outcome, a two-tailed test, an assumed mean of 50 percent employed in the control group (which the authors say is consistent with other MDRC studies), and assumed gains from regression adjustment of the impact estimate equivalent to a regression equation with an R-squared equal to .08 (which they also say is consistent with experience).(11) This calculation assumes pooling of applicant and recipient cases, and no increases in variance (often referred to as "design effects") due to stratification of the sample.(12) The two smaller research groups (E2 and C2) are roughly 2,000 cases each, but E2 is stratified by county. The larger groups (E1 and C1) are well above that level, but they were stratified by urban/rural location and (within these groups) into several other subgroups, with different sampling rates for the different subgroups (see the next subsection). The larger samples in groups E1 and C1 (over 6,000 in each) may balance or outweigh any design effects from stratification.