Approaches to Evaluating Welfare Reform: Lessons from Five State Demonstrations

10/01/1996

BACKGROUND AND OBJECTIVES

In recent years, most states have taken advantage of provisions in federal law that allowed them to obtain waivers from Aid to Families with Dependent Children (AFDC) and Food Stamp Program rules in order to implement welfare reform--between 1993 and 1996, the Clinton administration approved waivers for 43 states. The federal waiver process required rigorous evaluations of all welfare reform demonstrations and provided matching funds for the evaluations. Federal welfare reform legislation, passed in August 1996, replaces the AFDC program with block grants to the states. As states design and implement their new programs, innovation will continue. The need for information on program impacts and the relative effectiveness of various approaches will remain strong.

The Department of Health and Human Services (DHHS) sought to pull together information on the approaches used in the evaluations of state welfare reforms undertaken as part of the waiver process. The goals were (1) to obtain an overview of common issues the evaluations have faced, approaches that have worked well, and approaches that have not; and (2) to provide general information on good evaluation practices that will be useful to states as they plan evaluations of their new programs. DHHS asked Mathematica Policy Research, Inc., (MPR) to undertake this review of evaluations.

To meet these objectives, the authors reviewed evaluations in five states and consulted with a panel of evaluation experts. The five demonstrations whose evaluations we reviewed are (1) the California Assistance Payments and Work Pays Demonstration Projects, (2) the Colorado Personal Responsibility and Employment Program, (3) To Strengthen Michigan Families, (4) the Minnesota Family Investment Program, and (5) the Wisconsin Work Not Welfare demonstration. Four of the five demonstrations used an experimental evaluation design; one relied on a quasi-experimental design. The five evaluations also differed in many other respects. This report reviews their approaches, with particular emphasis on the analysis of the impacts of welfare reform. We review and present recommendations concerning selected issues in five areas: (1) choice between an experimental and quasi-experimental evaluation design, (2) sample design, (3) implementation of experimental evaluations, (4) data collection, and (5) analysis methods.

EVALUATION DESIGN

Two major types of evaluation designs have been used in welfare reform evaluations: (1) experimental designs, which involve random assignment of cases to an experimental group subject to welfare reform or to a control group subject to pre-reform policies; and (2) quasi-experimental designs, which compare cases subject to welfare reform policies to a comparison group subject to pre-reform policies that is separated from the demonstration group by space or time.

Experimental Designs

Most evaluations of state welfare reform demonstrations use an experimental design. DHHS has preferred experimental designs because they have strong internal validity--that is, they produce unbiased estimates of program impacts for the population subject to random assignment. In contrast, several studies that compared experimental estimates to nonexperimental estimates based on the same data have shown that nonexperimental approaches tend to produce biased estimates. Although experimental designs are thus generally the preferred approach, they have several limitations:

  • Lack of external validity (the ability to generalize the results beyond the specific sites or types of cases included in the experiment) is common.
  • Implementation of random assignment is often imperfect. Cases may cross over between the treatment and control groups, and control group policies may be affected by the reform or otherwise change from the pre-reform program.
  • Experiments cannot be used to measure entry effects (effects of the program on rates of application for assistance), since random assignment occurs only after application.
  • Experiments cannot capture community effects, since measuring community-wide effects requires comparing communities with and without the reform program for all cases.
Quasi-Experimental Designs

Only one waiver evaluation (Wisconsin's Work Not Welfare demonstration) was approved with a quasi-experimental design, but this design may become more common in a block grant environment. There are three major types: (1) pre-post designs (which use a comparison group from an earlier period in the same sites as the demonstration), (2) comparison site designs (which use a comparison group from other sites similar in characteristics to the demonstration sites), and (3) combinations of the two (which use both pre-reform and post-reform data from the demonstration and comparison sites). Quasi-experimental designs are stronger if pre-post and comparison site designs are combined, if there is random assignment of sites to demonstration or comparison status, and if there are many sites. Still, some differences between the demonstration and comparison groups (other than the intervention) may persist in a quasi-experimental design, leading to biased estimates of program impacts.

SAMPLE DESIGN

Sample design--formal analysis of the populations to be studied, the sample sizes needed to make reliable inferences, and the sampling procedures to be used--deserves more attention in welfare reform evaluations.

Ensuring Adequate Sample Sizes

To ensure that an adequate sample size is being used to address the research questions of interest, evaluation planners should specify the key outcomes that the sample is designed to measure (and the expected level of variation in those outcomes), the precision standards for impact estimates, and the minimum impact the sample should be able to detect. DHHS could provide technical assistance to states concerning appropriate sample design assumptions, based on a review of relevant studies.

Role of Applicant and Recipient Samples

Applicant and recipient samples are each of interest in their own right. Impacts on applicants provide the best measures of long-term program effects; impacts on recipients indicate effects on a more disadvantaged group. The sample should be designed so that each of these subsamples can support separate impact estimates. Sampling rates for applicant cases should be high enough to finish sampling within two years, to ensure an adequate follow-up period, and to minimize the risks of changes in the program or policy environment. In setting sampling rates for applicants, it is important to take into account the fact that those who reapply or transfer from a nonresearch site should not be subject to random assignment. To ensure sufficient sample sizes over a short intake period, applicants generally need to be sampled at higher rates than recipient cases.

Site Selection

If the goal of the evaluation is to generalize results to the state as a whole, care should be taken to select representative sites to the extent possible. Sites may be selected through a formal sampling process or selected judgmentally to represent specific characteristics. In either case, it is important to assess how the sample ultimately selected compares to the statewide caseload in background characteristics and (if reform is implemented statewide) in outcomes. Even if sites are selected through a formal sampling process, it is important to recognize that a sample concentrated in a few sites may provide very imprecise state-level estimates.

IMPLEMENTATION OF EXPERIMENTAL EVALUATIONS

Four aspects of the implementation of an experimental evaluation require special care if the results are to be meaningful: (1) the timing of random assignment, (2) the method of random assignment, (3) ensuring that control group policies remain unchanged, and (4) preventing experimental and control group cases from changing status.

Timing of Random Assignment

The timing of random assignment for both recipient and applicant cases is an important concern.

  • Recipient Cases. Random assignment of recipient cases at the time of redetermination of eligibility (which usually occurs every six months) has the advantage that cases can be informed in person of the policies that apply to them; this promotes awareness and understanding of policy changes and minimizes confusion. When random assignment coincides with redetermination, however, the recipient sample will only include cases that remain on the rolls through that point; thus, all short-term cases will be omitted. Assigning cases gradually as they come up for redetermination also slows down implementation of the reforms relative to random assignment of all existing cases at the same time.
  • Applicants. If reforms substantially affect eligibility, it is appropriate to do random assignment of applicants before determining eligibility and, then, to determine eligibility under the relevant program rules. However, this approach has the disadvantage of including some in the applicant sample who would be ineligible under either set of rules and, thus, who are not affected by the reforms. (These cases could be excluded by determining eligibility under both sets of rules to exclude cases ineligible both under reform and pre-reform rules from the sample. However, a dual eligibility determination often is judged too costly.)

Method of Random Assignment

Assigning cases using a computer-generated random number is the preferred approach. Most methods of random assignment are acceptable, however, if the selection process is not subject to manipulation by either program staff or clients. If an existing number (such as a social security number) is used in the selection rule, it is important that clients not become aware of the rule; otherwise, they may change their behavior.

Ensuring that Control Group Policies Remain Unchanged

It is often difficult to keep the program and policies available to the control group the same as they were before welfare reform. Sometimes there is spillover of new services (such as job placement services) from the experimental to the control group, and sometimes services that should be available to the control group are displaced because the experimental group is given priority. In addition, publicity about the reform may affect control group members. States can minimize such problems by having different (but equivalent) staff members work with control and experimental group members, by reminding control group members that they are not subject to reform policies, and by maintaining the appropriate level of resources for the control group.

Preventing Control and Experimental Cases from Changing Status

Evaluation planners should be aware of the risk of crossover between the experimental and control groups. Crossover typically occurs because cases migrate to a county with a different policy, because cases split or merge, or because of administrative errors. It is important to monitor the level of crossover and to design information systems that minimize crossover (for example, by ensuring that both halves of a split case retain the original experimental status).

DATA COLLECTION

Baseline Data

In quasi-experimental evaluations, it is absolutely critical to collect good baseline data in order to control for any differences in background between the demonstration and comparison groups. Baseline data are also important in experimental evaluations, where they are used to define subgroups and to assess potential problems with random assignment. In both types of evaluations, pre-program data on outcomes are particularly useful. If administrative records on these outcomes are maintained over time in a consistent format, this permits an extensive pre-program history to be assembled at relatively low cost. Other needed information is best collected on a short form filled out by case heads at the time of random assignment, with help from a program staff member. Key information includes a full listing of family members and their demographic characteristics, identifiers such as social security numbers, and contact information to help locate sample members for follow-up interviews.

Role of Follow-Up Surveys

Since most of the outcomes likely to be affected by welfare reform may be tracked in administrative records, we recommend that follow-up surveys be used judiciously. In general, surveys should address questions that administrative data cannot answer, their goals should be clearly and narrowly defined, and the sample should be designed to address these goals efficiently. Coordination to see that similar issues are approached in a similar manner across states (for example, by developing a standard question series for a particular issue) would be very useful. In addition, evaluation planners should consider less readily available types of administrative data (such as vital records) as alternatives to surveys.

Achieving High Response Rates

In follow-up surveys, high response rates are critical to ensure that the survey data produce unbiased impact estimates. We recommend that surveys be designed to achieve response rates of 80 percent or above for surveys within 18 months of random assignment and 70 percent or above for longer-term followups. To achieve such response rates typically requires (1) obtaining contact information at the time of random assignment, (2) updating contact information at least every 18 months (even if no interview is scheduled), (3) using on-line databases (such as credit bureau and motor vehicle records) to locate respondents, (4) mixed-mode interviewing (telephone interviews with in-person followups), and (5) paying respondents for their time. If a survey lacks one or two of these practices or is being conducted by a firm without specific experience in interviewing low-income populations, the survey may be at risk of low response rates.

ANALYSIS METHODS

For the following four types of analyses, there was substantial variation in evaluation approaches.

Distinguishing Impacts of Specific Components of Reform Packages

Experimental evaluation designs with a single experimental group are well suited to measuring the impact of the full package of reforms but are not well suited to analyzing the impact of specific reforms within a package. In some circumstances, it may be feasible to examine the impacts of subsets of a reform package by using multiple experimental groups, with one group assigned to the full package of reforms and one or more groups assigned to subsets of the package. If multiple experimental groups are not feasible, a careful study of program implementation often provides qualitative insight into which provisions of the reform are most effective; theoretical considerations may also suggest which provisions have the strongest effects. However, nonexperimental modeling of the impacts of specific components typically produces imprecise and unreliable estimates.

Analyzing Entry Effects

Large entry effects call into question the validity of results from an experimental evaluation, since the population that applies for the program will not be the same as it would have been in the pre-reform period. Analysis of time series data is the most feasible approach to studying entry effects; as in all nonexperimental approaches, however, results may be sensitive to available data and modeling assumptions. Research on entry effects has been sparse; there is a need for coordinated, cross-state studies to assess the types of changes most likely to produce large entry effects and the most useful modeling approaches.

Treatment of Crossover Cases

Crossover cases should be kept in an impact analysis in their original research groups to avoid sample selection bias and maintain a sample that is representative of the full population of interest. The impact estimates thus generated provide a lower bound on the effect of the program in the absence of crossover. At the same time, evaluators should provide information on the incidence of crossover and how crossover is defined, to allow an assessment of how much estimates may have been distorted.

Estimating Impacts for Subgroups Defined by Events Since Random Assignment

Data for certain types of outcomes may be available only for subgroups of the treatment and control groups that are affected by random assignment. For instance, data on some outcomes (such as children's health) may be collected in a survey in which the response rates differ for the treatment and control groups. Other outcomes are defined by experiences or actions after random assignment--for example, recidivism rates can be computed only for those who leave welfare. Because both of these types of outcomes are available only for nonrandom subsets of the experimental and control groups, the comparability in baseline characteristics between the experimental and control groups may be lost. In defining outcomes for analysis, we recommend that evaluators adopt strategies that take maximum advantage of the strengths of an experimental design. It may be possible to redefine outcomes in ways that use the entire sample. For example, instead of focusing on welfare recidivism rates, the analysis can examine the number (or percentage) of months on welfare since random assignment.