(1)1. The variance of the impact estimate is inversely related to T(1-T), where T is the proportion of the sample in the treatment group. All else equal, this variance is smallest when T = .5.

(2)2. In fact, in a comparison site design with two sites, it is impossible to distinguish the effects of the program from the effects of other site-specific factors that do not vary within the site. If a characteristic varies among individuals in a site, that variation can be used to identify its separate effect. Similarly, the larger the number of sites, the more ability there is to sort out other site effects from the effects of the intervention.

(3) Again, such unobserved factors must vary among individuals as well as across sites. Such models also require that a variable exists that predicts participation in the demonstration well (in a comparison site design, this may be equivalent to predicting where people live), but does not otherwise affect the program impact (know technically as an "identifying" variable). Otherwise, if the control variables in both the participation and outcome equations are the same, the predicted value of the participation variable will be either perfectly or highly correlated with the other variables in the outcome equation (since it is a function of those variables). The models also typically make restrictive assumptions about the distribution of the error terms. In referring to the "best of circumstances," we mean that lack of precision remains a concern in these models even when good identifying variables are available and the distributional assumptions are reasonable. Nonexperimental models are discussed further in Chapter VI.

(4)For example, if the federal requirement is for 2,000 approved control group applicants, but only 2 out of 3 applicants are approved, a sample of 3,000 control group applicants may be needed to satisfy the requirement. In an intervention that does not affect eligibility, a similar requirement implies a sample of only 2,000.

(5)The effect size is a way to standardize analysis of statistical power over different types of outcomes measured on different scales. A sample is selected to achieve a certain effect size (for example, to measure an impact equal to 10 percent of the standard deviation of the outcome) with, say, 80 percent power at a 95 percent significance level. The same sample size would be needed to reach a given effect size, regardless of the outcome measure.

(6)MAXIMUS, "Evaluation of the Work Not Welfare Demonstration: Evaluation Plan," pp. III-21-22 and Exhibit III-6.

(7)California Department of Social Services, "APDP Approval Case Sampling Plan," Attachment 1.

(8)Services offered to experimental cases are the same as those offered to most cases; some counties did not even tell caseworkers which cases were in the experimental group.

(9)Colorado Department of Social Services, "Sampling Plan to Implement the Colorado Personal Responsibility and Employment Program," p. 3.

(10)Methods for analysis of recidivism and similar outcomes are discussed in Chapter VI.

(11)MDRC, "Proposed Design and Work Plan for Evaluating the Minnesota Family Investment Program," p. 14.

(12)Technically, the design effect is the ratio of the standard error of an estimate from a complex sample design (for example, a design with oversampling of particular strata) to the standard error of an estimate from a simple random sample of the same size.

(13)Wisconsin sampled the full caseload in both demonstration counties and also may include the full caseload in the comparison counties.

(14)The basic sample in Minnesota is a proportional sample of recipients and applicants--they also oversample new applicants (defined only as applicants who had not been on AFDC for at least three years). The sampling rates in the urban counties were 13 percent for single-parent recipients, 80 to 86 percent for single-parent applicants, and 46 to 53 percent for two-parent applicants and recipients (Knox et al. 1995).

(15)The MFIP evaluation is the exception; it reports an intake of new applicants and reapplicants higher than expected. It is not clear if this indicates an effect of the demonstration or other factors. MDRC responded by cutting the sampling rates or the intake periods for several subgroups.

(16)Bloom (1995) provides a formula for making this adjustment, based on the R-squared expected for the regression equation.

(17)Strictly speaking, active cases in the research sample should be representative of the full active caseload in the state. However, DHHS generally has been willing to assume the sampled sites are representative of the caseload (see Section C).

(18)Weighting schemes become more complicated if sampling rates are changed over time, perhaps because sample intake has been lower than expected, or if weights must also be used for some other purpose (such as adjusting for oversampling of sites or subgroups).

(19)The implications of entry effects for the analysis are discussed in Chapter VI.

(20)The Upjohn Institute, as a consultant to the state, developed a model to select clusters of counties with approximately 1,500 cases each. It used 49 variables to describe each county and selected clusters with the goals of maximizing generalizability to all rural counties in the state and of having pairs of clusters that were well matched. In addition, it restricted the model so that no cluster could contain more than one county that was not interested in participating.