Approaches to Evaluating Welfare Reform: Lessons from Five State Demonstrations

Publication Date

Sep 30, 1996

Authors:

Anne Gordon, Jonathan Jacobson, and Thomas Fraker.

Mathematica Policy Research, Inc.,

ACKNOWLEDGMENTS

Many individuals contributed to this report. Karin Martinson and Audrey Mirsky-Ashby of the Office of the Assistant Secretary for Planning and Evaluation and Peter Germanis of the Administration for Children and Families (all within the U.S. Department of Health and Human Services) guided the project throughout its course, and provided careful review of the report. John Burghardt of MPR provided internal review. The expert panel convened as part of this project contributed thoughtful and stimulating comments on the issues and influenced many of the report's recommendations. The panel included Daniel Friedlander of the Manpower Demonstration Research Corporation, Robert Moffitt of Johns Hopkins University, Larry Orr of the Brookings Institution and Abt Associates Inc., and Michael Wiseman of the University of Wisconsin-Madison. Patricia Ciaccio edited the report, and it was produced by Lisa Puliti, Debra Jones, Monica Capizzi, and Jill Miller.

We are particularly grateful to the individuals who provided us with information on the five state waiver evaluations reviewed, including both state staff and evaluation contractor staff. The state staff we contacted were Frank Rondas (California Department of Social Services), George Kurian (Colorado Department of Human Services), Robert Lovell (Michigan Department of Social Services), Charles Johnson (Minnesota Department of Human Services), and Jean Sheil (Wisconsin Department of Health and Social Services). The evaluators we spoke with or otherwise communicated with were Barbara Snow of UC DATA, Karen Garrett of the UC-Berkeley Survey Research Center, Walter Furman and Alisa Lewin of the UCLA School of Public Policy, Peggy Cuciti of the University of Colorado at Denver, Alan Werner of Abt Associates Inc., Virginia Knox of MDRC, and Philip Richardson of MAXIMUS. Without their patient assistance, this report would not have been possible.

The authors alone take responsibility for any errors that remain, and for all opinions expressed.

ACRONYMS

AFDC = Aid to Families with Dependent Children

AFDC-UP = Aid to Families with Dependent Children--Unemployed Parent

APDP = Assistance Payments Demonstration Project (California)

CAPI = Computer-assisted personal interviewing

CATI = Computer-assisted telephone interviewing

CPREP = Colorado Personal Responsibility and Employment Program

CPS = Current Population Survey

DHHS = U.S. Department of Health and Human Services

JOBS = Job Opportunities and Basic Skills

MDRC = Manpower Demonstration Research Corporation

MFIP = Minnesota Family Investment Program

MPR = Mathematica Policy Research, Inc.

OBRA = Omnibus Budget Reconciliation Act

OMB = Office of Management and Budget

PPS = Probability proportional to size

RTI = Research Triangle Institute

SFA = State Family Assistance (Michigan)

TSMF = To Strengthen Michigan Families

UI = Unemployment Insurance

UP = Unemployed Parent

USDA - U.S. Department of Agriculture

VISTA = Volunteers in Service to America

WNW = Work Not Welfare (Wisconsin)

WPDP = Work Pays Demonstration Project (California)

Chapter 1: Introduction

The federal welfare reform legislation enacted in August 1996 (the Personal Responsibility and Work Opportunity Reconciliation Act of 1996--P.L. 104-193) eliminates the Aid to Families with Dependent Children (AFDC) entitlement program and replaces it with a block grant to the states. Under the block grant, states will design, implement (by July 1997), and administer their own programs to aid families with dependent children. These programs, however, must satisfy requirements in the new federal law concerning such matters as work requirements for program participants and limits on the amount of time that a family can receive assistance.

Although the former AFDC program was federally administered and imposed requirements on the states, it did offer the states considerable flexibility to deviate from those requirements through a formal process of applying for and receiving federal waivers to implement welfare reform demonstration programs. Provision for waivers of AFDC regulations is included in Section 1115 of the Social Security Act, which was added to the original 1935 act as part of the Social Security Amendments of 1962. Section 17(b) of the Food Stamp Act provides for similar waivers of federal regulations governing the Food Stamp Program. Beginning in 1992, the Bush administration actively encouraged states to initiate reforms to their welfare systems through these waiver provisions. The Clinton administration continued this encouragement and had approved AFDC waivers for 43 states as of August 1996. The new law limits the DHHS waiver authority, but expands Food Stamp waiver authority by giving states more flexibility to experiment with Food Stamp rules under waivers.(1) In addition, the act allows states to keep their waiver programs in place for the duration of the waivers, if the waivers are inconsistent with the requirements of the new law. Because this report is being prepared in September and October 1996, it is not clear how this provision will be interpreted or how many states will keep their waiver programs (and evaluations) in place.

In granting a state's request for waivers of regulations governing the AFDC and Food Stamp programs, the federal government required that the state satisfy explicit terms and conditions. The terms and conditions, which the federal government negotiated with each state, (1) defined the new policies under which the demonstration programs would be run, (2) required the demonstrations to be cost neutral to the federal government, (3) specified requirements for evaluations of the demonstrations, and (4) detailed the cost neutrality methodology and reporting requirements. Accompanying the terms and conditions was a list citing the specific provisions of the Social Security Act and the Food Stamp Act that were being waived and the nature of the waivers. The terms and conditions also required the state to contract with a third party to evaluate the effects of the waivers. In general, the evaluation was to include impact, cost-benefit, and process/implementation analyses.

Several divisions within the U.S. Department of Health and Human Services (DHHS) were responsible for reviewing and approving state Section 1115 waiver applications along with the Office of Management and Budget (OMB).(2) DHHS project officers were responsible for reviewing the evaluation plans that states submitted as part of their waiver application packages and monitoring the conduct of the evaluations. Although they shared certain design features, the evaluations used widely divergent methods to achieve similar research objectives. DHHS determined that it would be useful to obtain a broader perspective, from a source outside of the day-to-day activities of reviewing and monitoring the evaluations, to identify common research issues and to assess the appropriateness of the approaches that the states and the third-party evaluators had adopted. DHHS contracted with Mathematica Policy Research, Inc., (MPR) to conduct this study of evaluations. This report presents our findings and recommendations.

A. OBJECTIVES AND METHODS OF THIS STUDY

In recent years, the federal government has required most evaluations of welfare reform demonstrations to be based on an experimental design.(3) In welfare reform demonstrations, an experimental design means that some welfare cases are randomly assigned to the demonstration program (these are referred to as experimental group cases) and others are randomly assigned to the pre-reform program (these are referred to as control group cases). A comparison of individuals randomly assigned to the experimental and control groups provides an estimate of the impact of the reforms that controls for factors external to the demonstration, such as changes in the economy. In addition to requiring a common design for the demonstration evaluations, the federal government also specified a core set of outcome measures that must be included in the evaluations, including the employment and earnings of welfare recipients and their rates of marriage and separation.

Because the welfare reform demonstration evaluations generally share a common research design and core set of outcome measures, they repeatedly confront the same research issues.(4) These issues must be appropriately resolved if the evaluations are to produce reliable research results that are useful for assessing and making welfare policy. Given the similarity in many of the programs and in the overall research designs, the diversity of approaches that the evaluators take to specific research issues is somewhat surprising. Some evaluators do not address a particular issue, either because they fail to recognize it as a potential problem or because they recognize it but judge that doing nothing is the most appropriate course of action. Among those who actively respond to an issue, a variety of technical approaches may be used.

DHHS, as it observed the welfare reform demonstration evaluators struggling with a common set of research issues, recognized the need to systematically identify the issues and assess the approaches to them. This led to discussions between DHHS and MPR during fall 1995 that resulted in two conclusions that provided structure for the study of waiver evaluations:

To effectively review the designs for the welfare reform demonstration evaluations and monitor their progress, the federal government sought an assessment of the most important research issues common to the evaluations and the appropriateness of various approaches to them. This same information would also benefit the states and evaluators in designing and carrying out the evaluations.
Without the federal waiver process, more of the initiative for designing, implementing, and evaluating welfare demonstration programs would shift to the states. Many of the research issues surrounding the waiver demonstration evaluations would be present in those future evaluations. Thus, a compilation of research issues and approaches arising in the context of those evaluations would almost certainly be beneficial to future state welfare reform evaluations occurring outside the federal waiver process.

These observations implied two broad objectives for this study. One objective was to identify for the federal government, the states, and the evaluators the principal research issues surrounding the design and execution of the waiver demonstrations and evaluations and to assess the appropriateness of various technical approaches to those issues. The study's other objective was to document those issues and approaches in a sufficiently general way that the information would be useful in designing and conducting welfare demonstrations and evaluations outside of the federal waiver context.

To achieve these objectives, DHHS and MPR agreed that the study would be based on the actual experiences and decisions of states and evaluators conducting waiver demonstrations and evaluations. They also agreed that its focus would be on the estimation of impacts of the demonstrations, rather than the cost neutrality calculations or other federally mandated components of the evaluations. MPR would obtain this information primarily by reviewing documents prepared by the federal government, the states, and the evaluators pertaining to selected waiver demonstrations and evaluations. These documents included the following:

The state's waiver request, including its evaluation design
The terms and conditions under which DHHS and USDA granted a state waivers to implement its welfare reform demonstration
The evaluator's plan for the demonstration evaluation
Quarterly and annual reports on the progress of the evaluation
Interim evaluation reports, when available (no final reports were available)

When this study was being conducted in 1996, most of the waiver evaluations had not been under way long enough to have produced interim reports; this limited what could be learned by reviewing project documents. As a supplemental source of information on the evaluations, MPR staff members spoke by telephone with the state officials in charge of the selected evaluations, the directors of the evaluation contracts and (in some states) other researchers working on the evaluations.(5) These calls were made to fill in gaps in information obtained from documents and to obtain the latest information about the status, methods, and findings of the evaluations.

An advisory panel of experts on welfare evaluations was a final source of information for this project.(6) The advisory panel reviewed MPR memos identifying the key issues surrounding the evaluations, reviewed MPR summaries of state demonstrations and evaluations, and met with DHHS and MPR staff to comment on these items and to provide additional information about the evaluations and guidance regarding the course of the study.

B. SELECTION OF FIVE WELFARE REFORM EVALUATIONS

Because of budget limitations, the number of welfare reform demonstration evaluations considered in this study was restricted to five. DHHS identified 30 states that were evaluating waiver demonstrations. Several of these states had implemented more than one set of reforms and, consequently, were conducting multiple evaluations. The following criteria guided the selection of five evaluations to study:

Degree of completion of the evaluation
Anticipated level of cooperation by the state and the evaluator
Relevance of the intervention to broader welfare policy, in the sense of including features common in many waiver demonstrations and likely to continue in a block grant environment(7)
Design for the evaluation: experimental or quasi-experimental (comparison group)
Complexity of the intervention (ranging from a very focused change to comprehensive welfare reform)
Type of evaluator (for example, university, policy research firm)

We sought evaluations that demonstrated strength in criteria 1 through 3 and diversity in criteria 4 through 6. In making our selections, we also sought to achieve geographic diversity and gave preference to states with larger populations or welfare caseloads; however, we were more willing to compromise on the latter criteria than on the six listed here.

On the basis of these criteria, we selected the following five states and associated waiver demonstrations and evaluations:

California: (1) the Assistance Payments Demonstration Project, and (2) the Work Pays Demonstration Project
Colorado: the Colorado Personal Responsibility and Employment Program
Michigan: To Strengthen Michigan Families
Minnesota: Minnesota Family Investment Program
Wisconsin: Work Not Welfare

C. OVERVIEW OF THE SELECTED EVALUATIONS

Appendix A provides detailed descriptions of the five welfare reform demonstration evaluations that provided the information base for this study. This section summarizes that material.(8)

1. California

California's Assistance Payments Demonstration Project (APDP) and Work Pays Demonstration Project (WPDP) were implemented statewide in December 1992 and March 1994, respectively. Four of the state's 57 counties are serving as research counties, where ongoing welfare cases and newly approved applicants for assistance are randomly assigned to experimental or control groups. APDP/WPDP combine cuts in the maximum cash grant (the amount paid to families with no other cash income) with incentives for recipients to obtain jobs and increase their market labor. The incentives, which are common features of welfare reform demonstrations in many states, include removing the time limits on the AFDC $30 and one-third earnings disregards, raising the limit on the assets that a welfare recipient or qualified applicant may own, and eliminating the 100-hour limit on the number of hours per month that the principal earner in a two-parent family may work while the family qualifies for cash assistance (the "100-hour rule").(9) APDP reduces the maximum cash grant by about 11 percent.

To comply with the standard federal terms and conditions for welfare reform waiver demonstrations, California is conducting a random-assignment evaluation of APDP/WPDP. Although California implemented its waivers in phases beginning in December 1992, a single research sample (including both newly approved applicants and recipients) is being used to evaluate the waivers. The subsample of experimental cases is subject to all of the waivers as they are phased in; the subsample of control cases is subject to none of them. Because California's AFDC and Food Stamp programs are administered by the counties, there is no statewide welfare data system, which has presented a major challenge to the evaluation. The state Medicaid data system has been used extensively, and procedures for extracting standardized data from the county data systems have been developed. The Center for Child and Family Policy Studies at UCLA and UC DATA, a research unit at UC Berkeley, are conducting the APDP/WPDP evaluation.

2. Colorado

Unlike many of the welfare reform waiver demonstrations, the Colorado Personal Responsibility and Employment Program (CPREP) has not been implemented statewide; instead, it has been operating in 5 of Colorado's 62 counties since June 1994. CPREP is designed to encourage ongoing and new cases who do not have earnings to obtain jobs. It does this primarily by altering the financial incentives imbedded in the AFDC eligibility and benefit formulas. In particular, CPREP expands earned-income disregards, eliminates time limits on those disregards, and raises the eligibility limits on income and asset holdings. In addition, AFDC and child care benefits are combined in a single check and food stamp benefits are cashed out in a second check. Changes in the Job Opportunities and Basic Skills (JOBS) training program are relatively minor.(10)

The CPREP evaluation is being conducted in the five demonstration counties, based on an experimental design. Random assignment of applicant cases to experimental or control status occurs after the determination of their eligibility and benefit levels according to traditional AFDC program (control group) rules, although these rules impose tighter asset limits than under welfare reform. This timing of random assignment minimizes the burden on income maintenance workers. It should not introduce significant bias into the evaluation findings, because most of the waivers (other than those concerning assets) involve the treatment of earned income. CPREP is restricted to cases without initial earnings; therefore, the changes in the treatment of earned income are irrelevant for these cases at the time of application. The evaluation is being based on data from administrative files plus three waves of client survey data. A research unit within the University of Colorado at Denver is evaluating CPREP.

3. Michigan

Michigan's welfare reform demonstration, To Strengthen Michigan Families (TSMF), was implemented statewide in October 1992. Significant amendments to the demonstration were implemented two years later. TSMF combines many of the financial incentives of the California and Colorado demonstrations (for example, expansion of earned-income disregards, raising of limits on asset holdings, and elimination of the 100-hour rule for two-parent families) with a stiffening of requirements for participation in the JOBS program. The provisions added in October 1994 include elimination of most JOBS exemptions and an increase in the severity of sanctions for noncompliance with JOBS requirements. Another provision of the 1994 TSMF amendments requires that preschool children be immunized, with a penalty for noncompliance. (This provision is common to the waiver demonstrations in many states.)

A random-assignment evaluation of TSMF is being conducted in four welfare offices located in 3 of Michigan's 83 counties. The research sample includes both ongoing recipients and new applicants for assistance. Unlike the California APDP/WPDP and Colorado CPREP evaluations (but like many other waiver demonstration evaluations), applicant cases in the TSMF evaluation undergo random assignment before determination of eligibility. This permits eligibility to be determined according to the provisions specific to an applicant's treatment or control status. Denied applicants who did not qualify for the alternative State Family Assistance (SFA) program were excluded from the research sample because of data limitations.(11) The TSMF evaluation is based on data from administrative files; a single client survey is also planned. The relatively early implementation date of TSMF means that this evaluation has generated more results than the other four evaluations that make up the information base for this report. Abt Associates, Inc. is the TSMF evaluator.

4. Minnesota

The Minnesota Family Investment Program (MFIP) was implemented in April 1994 in 7 of the state's 87 counties. The same seven counties are serving as research sites for the evaluation. MFIP consists of three major elements: (1) consolidation of AFDC, Food Stamps, and Family General Assistance into a single benefit check with one set of rules governing eligibility and benefits; (2) expansion of financial incentives to work; and (3) stiffening of requirements for participation in the JOBS program. The work incentives are those found in many state waiver demonstrations: expansion of earned-income disregards, an increase in the limit on asset holdings, and elimination of the 100-hour rule for two-parent families. The new JOBS requirements include fewer JOBS exemptions and a 10 percent reduction in benefits for failure to participate in JOBS when participation is mandatory.

The MFIP evaluation has an experimental design with a unique feature that allows the impacts of distinct subsets of the waiver provisions to be estimated separately. (This contrasts with the typical experimental design for waiver demonstration evaluations, which only supports estimation of the overall impact of the full set of waiver provisions.) The unique feature of the MFIP experimental design is the random assignment of applicant and ongoing cases to four experimental/control groups. The four groups are (1) a full experimental group subject to all MFIP reforms (E1), (2) a partial experimental group subject to all MFIP reforms except those pertaining to the JOBS program (E2), (3) a traditional control group subject to the pre-reform AFDC program and the pre-reform JOBS program (C1), and (4) a second control group subject to the pre-reform AFDC program but with no access to the JOBS program (C2). By comparing outcomes for different pairs of groups, impacts of different program combinations may be estimated. These estimates will be based on administrative data, baseline data forms, and two rounds of follow-up surveys. The baseline data forms were completed by applicants/ongoing cases and program staff just before random assignment. The Manpower Demonstration Research Corporation (MDRC) is evaluating MFIP; Research Triangle Institute (RTI) is conducting the follow-up surveys.

5. Wisconsin

Wisconsin has implemented a number of welfare demonstration programs, authorized by distinct sets of waiver packages. Some of these demonstrations are operating statewide, while others have been implemented only in selected sites. Work Not Welfare (WNW), one of the latter type of demonstrations, has been operating in 2 of the state's 72 counties since January 1995. WNW is meant to radically alter the culture of welfare. Its intent is to affect the administration of welfare and how current and potential recipients and the community at large perceive it. The program seeks the active involvement of local businesses in creating jobs and of community organizations in supporting children and families. It imposes a two-year time limit on the receipt of cash assistance during any four-year period; however, exemptions to this requirement are granted. The program places a strong emphasis on encouraging applicants to enroll only in Food Stamps and Medicaid, thus avoiding the start of the two-year clock limiting the receipt of cash assistance. Those who receive cash grants must participate the requisite number of hours per week in employment or in approved education or training programs, or have their grants reduced in proportion to the shortfall in hours.

DHHS and the state of Wisconsin agreed that an experimental evaluation of WNW was inconsistent with the demonstration's emphasis on communitywide changes in the culture of welfare; it was feared that random assignment would dilute the cultural change sought. Consequently, Wisconsin received approval from the federal government for a quasi-experimental "matched comparison county" design for the WNW evaluation. Each of the two demonstration counties was matched with two comparison counties (for a total of four comparison counties) that have similar economic, demographic, and welfare caseload characteristics but in which WNW is not operating. Impacts on applications, enrollments, caseloads, and other outcomes will be estimated on the basis of aggregate time series data and case-level longitudinal data for the demonstration counties and matched comparison counties. MAXIMUS, a policy research firm, is conducting the WNW evaluation.

Notes

(1)A Medicaid waiver process also exists, and this is administered by the Health Care Financing Administration. In some instances Medicaid waivers were part of a comprehensive welfare reform package. In other cases, the Medicaid program has granted waivers for comprehensive Medicaid demonstrations. The new law does not eliminate or restrict the Medicaid waiver process.

(2)Because states typically submitted requests for Food Stamp waivers and AFDC waivers as part of an integrated package, DHHS worked closely with the U.S. Department of Agriculture (USDA) to review and process those requests. The evaluations of these waivers also tended to be highly integrated, with DHHS and USDA again cooperatively monitoring the evaluations.

(3)Here and in the rest of this report (unless otherwise noted), federal government refers to DHHS, OMB, and USDA, and demonstrations refer to AFDC, Medicaid, and Food Stamp reform demonstrations approved under Section 1115 of the Social Security Act and Section 17(b) of the Food Stamp Act.

(4)This report uses the present tense to describe the welfare reform demonstration evaluations because, despite the passage and signing of the new welfare law, most of the evaluations of the previously approved waiver demonstrations were ongoing at the time of the writing of this report in September and October 1996.

(5)These data were collected in the first half of 1996, and reflect the status of the evaluations through June 1996.

(6)The advisory panel members were Daniel Friedlander of the Manpower Demonstration Research Corporation, Robert Moffitt of Johns Hopkins University, Larry Orr of Abt Associates Inc. (on leave at the Office of the Assistant Secretary for Planning and Evaluation, DHHS), and Michael Wiseman of the University of Wisconsin.

(7)This objective was somewhat constrained by Objective 1, however. For instance, only one program testing time-limited welfare was included.

(8)Appendix B describes the material reviewed concerning each evaluation.

(9)The AFDC disregard of the first $90 in earnings per month (to offset work expenses) was not time limited. However, there were time limits of 12 months on the disregard of the next $30 of earnings and 4 months on the disregard of one-third of earnings in excess of $120.

(10)AFDC rules required all nonexempt adult AFDC recipients to participate in the state-administered JOBS employment and training program.

(11)Since most of the TSMF applicants who would be denied AFDC under control group rules but not under reform rules qualified for SFA (and therefore were included in the research sample), this exclusion is not expected to create serious problems for the evaluation.

Chapter 2: Objectives and Methods of the Welfare Reform Waiver Evaluations

This chapter reviews the objectives of the welfare reform waiver evaluations. It then identifies alternative potential designs for the welfare reform impact evaluations and assesses their strengths and limitations.(1)

A. OBJECTIVES OF THE EVALUATIONS

The terms and conditions under which the federal government granted waivers to the states to implement welfare reform demonstration programs included specifications for evaluating the demonstrations. These specifications encompassed the basic design for the evaluations, data collection activities, outcome measures, and types of analyses. From these specifications, we can infer that the federal objectives for the evaluations were to answer the following questions:

What was the process by which the demonstration program was designed and implemented? What problems were encountered during implementation? How did the demonstration as implemented differ from (1) the pre-reform program and (2) the demonstration as planned? The process analysis was intended to answer these questions.
Did the demonstration satisfy the requirement of being cost neutral to the federal government? If not, what was the level of additional costs incurred or savings achieved? The cost neutrality analysis was intended to answer this question.
What were the impacts of the reform program, relative to the pre-reform program, on a wide range of outcomes? The outcomes states were required to examine generally included:

- Participation in the AFDC and Food Stamp programs and associated benefit levels

Employment and earnings
Participation in the JOBS program
Family structure and stability
Child well-being

Other outcomes, such as child school attendance and child inoculations also were specified in the terms and conditions if specific provisions of a demonstration program were designed to influence those outcomes. The impact analysis was intended to generate estimates of the impacts of the demonstration on these outcomes.

· Did the benefits derived from the reforms exceed the costs, as assessed from the perspectives of the program participants, various levels of government, and society as a whole? The cost-benefit analysis was intended to answer this question.

The terms and conditions specified requirements for the impact analysis that supplemented the core objectives noted here. For example, they required impact estimates for subgroups of the AFDC population. At a minimum, this included separate estimates of impacts by a case's AFDC applicant/recipient status in the month of random assignment and the characteristics of age and race. The terms and conditions posed two additional objectives for the impact analysis as feasibility assessments rather than as required analyses: (1) to determine the effects of the demonstration on entry into AFDC (that is, effects on applications, approved applications, and caseloads), and (2) to estimate impacts of discrete components of the overall waiver package. Most of the designs for the evaluations were not conducive to the estimation of either entry effects or the impacts of separate components of a waiver package.

Although states undertook the waiver evaluations in response to the requirement to do so in the terms and conditions, they also had their own objectives for the evaluations. For example, a state may have an especially strong interest in one or more outcome measures not emphasized (or not even mentioned) in the terms and conditions. In addition, states seeking to fine-tune their programs may have given great importance to estimating the impacts of discrete components of a waiver package. The state may have been most concerned about obtaining process information concerning program implementation and client experiences and have had much less interest in the impact evaluation. States may also have had objectives for the evaluations that could best be addressed through analyses other than the four types discussed earlier. For example, some states sought frequent feedback from welfare participants on their perceptions of welfare reform and how well it was meeting their needs. Responding to these objectives in some instances required periodic customer satisfaction surveys or focus group discussions with welfare clients.

A large number of objectives for a welfare reform evaluation can imply that an evaluation will have difficulty addressing all of them well. Efforts to do so may lead to design changes or shifts in resources that result in less reliable estimates of central outcomes such as employment and welfare participation. One example of this in the waiver context is the design and fielding of client surveys on a wide array of topics, but only after it was too late to obtain the contact information at sample intake needed to ensure a high response rate. Such dilution of effort and the resulting reduction in the quality of research can be avoided if all organizations involved work together from the start to set clear priorities. The priorities should reflect the policy importance of outcomes and the accuracy with which they can be measured.

B. ALTERNATIVE DESIGNS FOR IMPACT EVALUATIONS

With only a few exceptions, the terms and conditions for welfare reform demonstrations in the 1990s have required evaluations based on an experimental design. (The most notable exception, the evaluation of Wisconsin's WNW demonstration, is discussed later in this chapter and elsewhere in this report). To assess the advantages and limitations of an experimental design, it is helpful to identify the key features of this design and several nonexperimental designs:

Experimental Design. Within selected research sites in which the reform program and the pre-reform program are operating side by side, target cases are randomly assigned to experimental status (the reform program) or control status (the pre-reform program). For the welfare reform demonstration evaluations, the target cases are ongoing welfare recipients and new applicants for assistance. Random assignment ensures that the experimental and control cases are alike, on average, in all respects except for the welfare program rules that they face. Thus, differences in average outcomes for the two groups can be attributed to the reform program. This type of design is said to have a high level of internal validity.
Self-Selected Comparison Group Design. This design typically is used to compare two versions of a program operating side by side, when target cases are permitted to choose which program to apply for or participate in (for example, two types of training programs). It may also be used to examine the effects of a program (versus nonparticipation) when it is deemed morally or practically impossible to limit who participates. A multivariate statistical model can be used, in principle, to control for differences in characteristics between cases that select the program of interest and cases that select the alternative program, thus isolating impacts due to differences between the two programs. In practice, such a model has two important limitations: (1) the impact estimates may be sensitive to the exact specification of a statistical model (typically, little guidance is available regarding certain aspects of model specification);(2) and (2) the statistical model may do a poor job of controlling for differences in difficult- to-measure characteristics of cases or individuals, such as self-esteem and ambition, that affect the program they choose, and this limitation (known as selection bias) may bias estimates of impacts.
Quasi-Experimental Design. A quasi-experimental design entails the selection of one or more groups of program applicants or participants to receive the reform program (the demonstration groups) and other groups to receive the pre-reform program (the comparison groups). The demonstration and comparison groups are separated in space or time but are matched on the basis of aggregate characteristics that are believed to influence the outcomes of interest. Despite the best attempts to match demonstration and comparison groups, important differences often exist, and these may be a source of bias in impact estimates. Statistical models have the potential to control for such differences, if the differences can be measured at the individual level, but they have the same limitations in a quasi-experimental design as in a self-selected comparison group design. (The advantage of the quasi-experimental approach is that individual-level differences should be smaller.) However, the key disadvantage of the quasi- experimental design is that any site-level (or time-period-specific) differences that affect outcomes may be confounded with the effect of the program. (Examples could include differences in economic climate or program administration.) The problem in this instance is not that the differences are difficult to observe, but that they only vary across sites (or time periods), and the number of sites (periods) is generally too small to allow all of these factors to be controlled for. Additional information on quasi- experimental designs is provided in Section B.3.

The first major application of an experimental design in social welfare policy research was to evaluate the negative income tax experiments of the late 1960s and early 1970s (Burtless and Hausman 1978; and Keeley et al. 1978). Since that time, there have been many social welfare policy evaluations based on experimental designs (Greenberg and Shroder 1991). The number and diversity of these evaluations have been increasing in recent years. Using data on several of these evaluations, methodological studies were conducted to determine whether nonexperimental evaluation methods could yield impact estimates similar in sign and magnitude to those generated by experimental methods (LaLonde 1986; Fraker and Maynard 1987; and Heckman and Hotz 1989). The interpretation of the findings from these studies remains controversial (Heckman and Smith 1995). The most common conclusion, however, is that nonexperimental estimators frequently provide different results than would be found in an experimental evaluation, and are therefore biased. Furthermore, the nonexperimental results are sensitive to minor changes in model specification. Thus, experimental estimators are preferred (Burtless 1995; and Friedlander and Robins 1995). DHHS shares this conclusion, as shown by the strong preference it exhibited for experimental evaluations of the welfare reform waiver demonstrations. In special circumstances, however, it approved alternative designs for evaluations of waiver demonstrations.

Despite the methodological strength of an experimental design, the difficulty of implementing such a design sometimes may limit its usefulness. In addition, there may be nontechnical reasons for preferring an alternative design (such as considerations of cost or fairness). The next two subsections consider the advantages and limitations of an experimental design, with particular emphasis on the needs of the impact analysis component of an evaluation. The third and final subsection defines various permutations of a quasi-experimental design and discusses when such a design might be desirable.

1. Advantages of an Experimental Design

The principal advantage of a well-planned and well-executed experimental design is that it ensures that, in other respects than receipt of the treatment, experimental and control cases are alike. The difference in average outcomes between the experimental and control groups is thus an unbiased estimate of the average impact of the program; this is known as internal validity. This eliminates the need to rely on a multivariate statistical model to control for case characteristics.(3) Consequently, the estimation of impacts in an experimental design is straightforward. The central feature of an experimental design, random assignment, does what a multivariate model attempts to do in a nonexperimental design: it controls for differences in characteristics between cases that receive the reform program and those that do not. Random assignment imposes this control more effectively, however, essentially eliminating any possibility of bias from imperfectly controlling for background characteristics.

Another important advantage of an experimental design is that all cases--those receiving the reform program and those not receiving it--coexist in the same site or sites during the same time period. They are therefore exposed to the same economic and other factors that may influence outcome measures independently of the program reforms. This strategy avoids a principal limitation of a quasi-experimental design, which is that cases in one group may be exposed to plant closings, migration, floods, and other economic, social, and natural phenomena that cases in the other group are not exposed to.

2. Limitations of an Experimental Design

The advantages of the experimental design discussed earlier are compelling. When the design is implemented carefully, most policy researchers see these advantages as eclipsing the limitations discussed next. However, there may be particular applications in which one or more of these limitations looms large--perhaps because of a strong policy need for information on a specific type of outcome that an experimental design is not well suited to provide.

An experimental design can be costly and challenging to implement. Program staff members sometimes are reluctant to implement random assignment; substantial training may be necessary to convince them that it is worth doing and doing right. Alternatively, it may be necessary to contract out certain aspects of random assignment. Either approach can be expensive. Program staff members also must be trained to operate the reform and pre-reform programs side by side in the research sites. Both random assignment and the operation of two programs simultaneously require additional managerial resources.

Two limitations are associated with the challenge of successfully implementing an experimental design. First, because an experimental design often is difficult and costly to implement, state administrators generally select only a subset of counties (or other administrative units) to implement random assignment. They may be inclined to choose only those sites that they believe will be successful both in implementing random assignment and in operating the reform and pre-reform programs concurrently. Selection of any small group of sites--particularly those more likely to be successful--means that the research sample of experimental and control cases is unlikely to be representative of the statewide welfare caseload (the broader population of interest). Consequently, findings from experimental evaluations frequently lack external validity, meaning that users of the research cannot generalize from the findings for the research sample to the full (state) population. With alternative designs that are easier to implement, state-level administrators may be more willing to select research counties randomly or to allow all counties to be research counties. Either approach may yield findings with a high degree of external validity.

A second limitation associated with the difficulty of implementing an experimental design is that it may be difficult to maintain pure versions of the reform and pre-reform programs for the experimental and control groups. Participants in the pre-reform program may receive elements of the reform program, or vice versa. For example, program staff could have difficulty keeping the rules of the two programs separate, or participants in one program could be exposed to advertising or news accounts of the other program and mistakenly assume that the rules governing the other program apply to them. Any such mixing of elements from the two programs would tend to bias impact estimates toward showing no impact of the reform program. In addition, cases in the experimental and control groups could be exposed to the other program if they migrate to a nonresearch site that is operating the other program or if they split into two cases or merge with a case that has a different research status.

Unless specifically designed to do so, an experimental design does not provide a strong basis for estimating the impacts of individual components or sets of components within a package of reforms.(4) To allow estimation of component impacts, a design must include random assignment of cases to multiple experimental groups. The number of such groups increases as the number of program components with impacts to be estimated increases. The number of different programs that must be operated also increases. Few states are willing to take on such an administrative burden. It can be done, however, as shown by the MFIP demonstration, in which a four-group experimental design is being used to estimate the overall impacts of the demonstration as well as the separate impacts of two distinct sets of reforms.

Some welfare reforms may be designed to discourage families from applying for welfare or from entering welfare if they are eligible; others may actually encourage applications (for example, among two-parent families). An experimental design will not support the estimation of such entry

effects because they occur prior to application and thus random assignment. Furthermore, although an experimental design will still give unbiased impacts for those who apply for welfare after welfare reform has been implemented, substantial entry effects may imply that these estimates are not applicable to the population that would have applied under the old program. A nonexperimental study of entry effects that examines application behavior over time is vulnerable to differences between reform and pre-reform groups that are not related to the demonstration; however, no practical experimental alternatives are available.(5)

Similarly, if an intervention is designed to have substantial community effects (that is, to change the culture and mores of an entire community), it may be necessary to implement the new program on a saturation basis in selected sites, and this precludes the use of an experimental design. The federal government approved the use of a quasi-experimental design to evaluate Wisconsin's WNW demonstration, largely because this demonstration was designed to have substantial community effects. There was also concern that the program had been designed to reduce caseloads by discouraging entry into cash assistance. The following subsection provides additional information on quasi-experimental designs and the application of such a design in the context of WNW.

3. Quasi-Experimental Designs

In some circumstances, states may wish to pursue quasi-experimental designs for evaluating welfare reform programs. Motivations for pursuing these designs include the following:

The state may not wish to invest the resources needed to operate two programs and monitor random assignment.
The state may be reluctant to use an experimental design for political or ethnical reasons.
The state may believe the welfare reform program must apply to all cases in a local area to be effective, because the program is intended to have substantial community effects, or because high levels of publicity necessarily imply that a control group would be affected.
The state may believe that large entry effects will result from the program intervention, thus calling into question the usefulness of random assignment. (Even if random assignment occurs at the point of application, it cannot capture entry effects, which occur before an application is made.) However, a nonexperimental analysis of entry effects may be coupled with an experimental design as well as a quasi-experimental design.

The rest of this subsection considers criteria for a strong quasi-experimental design and reviews the limitations of this design, even in the best of circumstances.

A quasi-experimental design uses a comparison group separated from the experimental group in time or space. The comparison group consists of a set of cases that have not been given the opportunity to participate in the reform program. Possible configurations include:

· Pre-Post Design. This design uses as a comparison group a set of cases in the same site as the new program (which could be the entire state), but from a period before the reforms were implemented. The analysis may be conducted at the case level or may use data aggregated by county or other geographic region. The problem with the pre- post design lies in distinguishing the effects of the intervention from the effects of any other factors that change at the same time, such as unemployment rates, demographic characteristics of the low-income population, or changes in related programs. The more periods of pre-program and post-program data that are available, the more potential there is to distinguish the effects of welfare reform from other changes. A major advantage of this type of quasi-experimental design is that it is inexpensive to implement if the data are available. However, it does require that the state maintain longitudinal data on welfare cases on a regular basis.

· Matched Comparison Site Design. The preferred method for implementing a matched comparison site design has two steps. The first step is to choose pairs of sites suitable for implementing the demonstration program, matched as closely as possible in terms of demographic and economic characteristics and characteristics of the program (other than the reforms being tested). The next step is to randomly pick one member of each pair to be a demonstration site and one member to be a comparison site.(6) If, instead, demonstration sites are selected first from those willing to implement the demonstration, then the best matches are selected from those not willing to implement the demonstration, the design is weaker, since the demonstration's success may be correlated with administrators' interest in being a demonstration site. Even with random selection among matched pairs, the small number of sites involved in most demonstrations implies that impact estimates may be biased if there are site differences not captured by the matching criteria, or if events (such as plant closing or openings or natural disasters) occur that lead to major changes at one of the sites in a pair.

· Combination Pre-Post/Matched Comparison Site Design. The strongest quasi- experimental design is a combination of the pre-post and comparison site designs. This involves a comparison site design, with pre-reform samples from both the demonstration and comparison sites. In such a design, the impact of the program is measured as a "difference in differences"--the difference in outcomes before and after welfare reform in the demonstration sites is contrasted with the difference in outcomes over the same time period in the comparison sites. This approach "nets out" differences between the sites that are constant over time, by comparing changes, rather than levels. However, differences between the sites that change over time may still be confounded with the effects of reform. For instance, a plant closing in a comparison site after program implementation may destroy the initial similarity between the two sites in a pair and, thus, lead to biased impact estimates.

The WNW evaluation is based on a combined pre-post and matched comparison site design. A "difference in differences" analysis will be used for public assistance-related outcomes, for which data from five years before the implementation of the demonstration are available for all Wisconsin counties. These analyses will be conducted at both an aggregate level (with the county-month as the unit of analysis) and a disaggregate level (with the case-spell as the unit of analysis). Cross-site comparisons will be conducted of outcomes for which no pre- implementation data are available, such as employment.

The WNW demonstration sites were selected before the comparison sites, and they were selected from sites with a particular interest in implementing the WNW model. There are only two demonstration counties; both are small and relatively prosperous. For each demonstration county, MAXIMUS (the evaluation contractor) selected two nonadjacent comparison counties that are similar in characteristics such as urbanicity, population, and caseload size. It will use multivariate statistical models with case-level control variables to attempt to control for remaining differences. It is unlikely, however, that matched comparison counties and statistical models will adequately control for the fact that the demonstration counties were preselected. It may not be possible to separate the effects of the program from the effects of being in a county where program staff and administrators were highly motivated to put clients to work.

Notes

(1)Appendix C provides a glossary of evaluation terms used in this report.

(2)The specification of a statistical model refers to (1) the number, type, and measurement of control variables, (2) the measurement of policy variables (for example, participation in the reform program can be measured as a dichotomous yes/no variable or as a continuous months-of-exposure variable), (3) the measurement of outcome variables (for example, current employment can be measured as a dichotomous yes/no variable or as a continuous hours-per-week variable), (4) functional form (that is, whether the relationship between independent and dependent variables is linear or some nonlinear function), and (5) the assumed distribution of the error term. Misspecification of a model along these or other dimensions can result in it generating biased estimates of the impacts of a program reform.

(3)With an experimental design it still may be useful to use a multivariate model to increase the precision of impact estimates.

(4)Nonexperimental methods for estimating the impacts of specific reforms may be employed in an experimental context just as in a nonexperimental context. However, as discussed in Chapter VI, such approaches rarely work well.

(5)An experiment would have to randomly assign all members of the population who could conceivably be at risk of entering the program. Alternately, as discussed in Chapter VI, an experiment could randomly assign a large number of sites, and compare entry rates in experimental and control sites.

(6)Although there is random assignment of sites in this type of design, we do not consider this a truly experimental design, because the sample of sites is typically much too small to rule out the confounding of site-specific factors with the effects of the program.

Chapter 3: Sample Design

The sample design is a critical aspect of the design of the welfare reform waiver evaluations. Sample design includes (1) decisions concerning the overall sample size, (2) allocation of the sample between experimental (or demonstration) cases and control (or comparison) cases, (3) decisions concerning whether to oversample key subgroups (and sample size goals for those groups), and (4) decisions about selecting sites (including the number of sites and the method of selecting them). Key sample design decisions in the welfare reform waiver evaluations usually have been made by state policy and evaluation staff, often under considerable time pressure. Political and administrative considerations have affected decisions concerning the number of sites for evaluation, the specific sites chosen, and the level of resources committed to the evaluation (which limits sample sizes). The federal government has played a disciplining role in sample design by requiring a design that could address federal cost neutrality and by setting minimum standards for sample sizes. Federal staff members also have provided technical review of state designs and advice to the states. Often, evaluation contractors have not been involved in the sample design; they have been involved only after the sample design has been implemented. Two of the five evaluations reviewed here, however, involved evaluators to some extent in the sample design.

This chapter outlines the issues that must be confronted in developing a good sample design, to help those planning future welfare reform evaluations be better informed in making these decisions. The issues we focus on are:

Adequacy of sample size overall and for key subgroups
Roles of the recipient and applicant samples, as well as implications for the relative sample sizes in these two groups and the design of applicant sampling
Importance of generalizability or external validity of the results from the evaluation, as well as the implications for site selection

For each of these topics, we outline the key issues that need to be confronted in designing the sample, describe the choices made in the five state evaluations we reviewed, and present recommendations.

A. ADEQUACY OF SAMPLE SIZE

In evaluating welfare reform, it is important to have adequate samples to learn about the effectiveness of the program. The larger the sample, the more precisely the impacts of the program can be estimated. The larger the sample, however, the more costly the implementation of the welfare reform demonstration. In random-assignment evaluations, administering two sets of policies for research is the major cost for most states; at least some administrative costs (such as training staff members to handle both policies, monitoring random assignment) will increase with sample size. Federal officials, in preparing the waiver terms and conditions, have specified minimum sample sizes, with some variation according to the particular needs and objectives of the evaluation. States have been encouraged to exceed the minimum if possible.

1. Issues

Each evaluation must address several issues concerning what is an adequate sample size:

Goals of the sample design, particularly the key outcomes to be measured
Minimum precision standard for each goal
Balance of the sample between the experimental or demonstration group and the control or comparison group
Relative emphasis on overall impacts versus subgroup impacts
Implications for sample size of a nonexperimental versus an experimental design

a. Outcomes to Be Measured

In Chapter II, we discussed the importance of narrowing or prioritizing the list of research questions that an evaluation is intended to answer. This is particularly important in sample design, since a sample that is designed to provide precise estimates of one outcome may be very weak for other outcomes. To build a sample that can answer the key research questions, it is important to determine the key outcome (or, at most, a handful of key outcomes) the evaluation is seeking to address, the level of variation in that outcome, and the expected magnitude of the impact on that outcome.

In most welfare reform evaluations, four key outcomes are the focus of the impact analysis: (1) the proportion of cases on cash assistance, (2) the mean benefit per case, (3) the proportion of cases with someone working, and (4) the mean earnings per case. Of these four outcomes, those that policymakers consider particularly important should be the focus of the sample design. If all four are of roughly equal importance (as often happens), the most conservative strategy is to focus on the outcome for which the relevant impact is likely to be hardest to detect (that is, the outcome that requires the largest sample to detect a statistically significant impact). The two factors that determine the ease of detecting an impact for a particular outcome are (1) the variance of the outcome (which affects the variance of the impact estimate), and (2) the likely magnitude of the impact.

Among the four outcomes, earnings is likely to have the largest variance relative to the mean, and thus to require the largest sample size to detect an impact of a certain proportion; therefore, in many cases, samples are most conservatively designed to detect impacts on mean earnings. The likely magnitude of the impact also is important, however. In many past employment-training demonstrations, the proportionate impact on AFDC benefits tended to be smaller than the proportionate impact on earnings (Gueron and Pauly 1991). If a key goal is to be able to detect even a small impact on cash assistance benefit levels, that outcome may be the appropriate focus of the sample design.

A sample well designed for assessing impacts on these key outcomes may be weak for assessing other types of impacts. For example, the terms and conditions have required many states to assess the impacts of welfare reform on Medicaid paid claims. Because Medicaid paid claims vary extensively in the population (as some individuals have very high medical costs, but most have low costs), even large average experimental-control differences in Medicaid claims may not be statistically significant, with a sample designed primarily to estimate impacts on earnings.

Regression adjustment of impact estimates for baseline characteristics reduces the standard error of the impact estimates slightly (and thus, in principle, the sample size needed to detect a certain difference). Of the random-assignment evaluations reviewed here, only the Minnesota MFIP evaluation took into account the role of regression adjustment in determining desired sample size.

b. Precision Standard

The needed sample size also depends on the level of precision at which the impact is to be measured. The precision standard for a sample design is determined by three factors: (1) the desired level of statistical significance for the impact estimate, (2) the power of the sample design (the probability of detecting the desired effect), and (3) whether a one-sided or a two-sided hypothesis test is used. A result is referred to as statistically significant if the probability of the true impact being zero, given the estimated impact and its standard error, is very low--generally 10 percent or less (typical standards are 10 percent, 5 percent, or 1 percent). For a given size impact, the smaller the standard error, the more statistically significant the estimate; larger sample sizes are thus required to detect an effect at the 1 percent level of significance than at the 5 percent level. The power of the design is the probability of detecting an effect, assuming an effect of a given size is present--for example, if the design has 80 percent power to detect a 5 percentage point impact at a 5 percent significance level, then, assuming the true impact of the program is 5 percentage points, the probability that a statistically significant impact will be observed is 80 percent. The larger the sample size, the higher the power of the sample to detect impacts of a given size and significance level.

Most evaluation research uses two-sided hypothesis tests, under the assumption that it is useful to distinguish effects in the desired or the unintended direction from policies with no effect. Bloom (1995) argued that one-sided tests may be adequate for most evaluations, since the key concern is to distinguish whether a policy had the desired effect or not. The advantage of one- sided tests is that smaller sample sizes are needed than in two-sided tests to achieve a given level of power and statistical significance.

c. Sample Balance

Dividing the sample into equal numbers of experimental (demonstration) and control (comparison) cases (this is referred to as a "balanced" design) leads to estimates with the highest level of precision, for a given total sample size.

(1) However, substantial deviations from this balance may occur with only minor losses in precision (Bloom 1995). States may prefer an unbalanced sample because of a desire to implement the reform program as completely as possible (if the reforms are implemented statewide for all cases except control cases). By having the minimum allowed number of control cases but more experimental cases, states can increase sample precision while keeping the control group as small as possible. Thus, in many evaluations in which the intervention is implemented for everyone except the control group, the sample is designed to include two experimental group members for every control group member. Increasing the ratio of experimentals to controls beyond 2:1, for a fixed total sample size, leads to more substantial loss in precision. Increasing the total sample size by adding additional experimental group members (but keeping the control group sample the same) increases precision only slightly.

d. Trade-Offs Between Subgroup Analysis and Full-Sample Analysis

Oversampling of key subgroups allows the evaluation to obtain more precise estimates of program impacts for the subgroups of interest. However, such oversampling (if total sample size is held constant) also reduces the precision of the estimates of impacts on the full sample. This becomes less of a concern if there are enough resources to have larger than minimum sample sizes overall, since the increase in precision from having a larger sample will at least partly balance the loss in precision from stratification.

For example, suppose subgroups are defined as the individual demonstration sites. Samples may be allocated across the sites in three ways:

No Stratification. If the population about which inferences are to be made is the caseload in the research sites only, sampling rates should be the same in all the sites, and the sample sizes in the sites should be proportional to the number of cases in those sites.
Stratification to Increase Precision of Site-Level Impact Estimates. To make inferences about impacts in specific sites as well as the entire group of research sites, sample sizes should be set to balance the precision needs of the two types of estimates. In general, cases in the smaller sites will be oversampled in relation to cases in the larger sites. It still may be desirable to have larger samples in larger sites, however, to increase precision of the overall estimates, as long as the samples in the smaller sites meet a minimum standard for site-level precision.
Stratification to Increase State-Level Representativeness. If the population about which inferences is to be made is the entire state caseload, the sampling process is appropriately conceived of as a two-stage sampling process, in which sites are selected first, then cases within sites. Such a design could, in principle, lead to oversampling of either large or small sites. In this setting, implications for precision are most appropriately evaluated in the context of the state as a whole.

These same three approaches can be applied to determining sample sizes for other subgroups.

e. Nonexperimental Versus Experimental Design Requirements

In general, nonexperimental designs require larger samples than experimental designs for a given outcome measure. For example, suppose a design compares applicants to a welfare reform program in some counties with applicants to the current program in other counties. Suppose also that differences (other than the welfare reform program) between the demonstration and comparison groups could be completely controlled for using measured background characteristics. Even in this case, for a given sample size, the standard error of the regression- adjusted impact estimate would be larger than in an experimental evaluation because of correlations between the welfare reform site indicator and the background variables in the equation. Intuitively, the greater the extent to which variables are correlated (tend to move together), the larger the sample required to "sift out" their separate effects--in this case, to separate the impact of the program from the effects of other characteristics.(2)

The difficulty of sorting out program impacts from other factors is magnified if there are unobserved differences between the demonstration and comparison groups ("selection bias"). In the best of circumstances, these differences may be adjusted for using two-equation models.(3) In many such models, the first equation predicts membership in the treatment (demonstration) group (as a function of individual or site characteristics). The second equation estimates the effects of the program using predicted treatment status from the first equation rather than actual treatment status. Such models typically produce very imprecise impact estimates and therefore require much larger sample sizes to detect impacts of a given magnitude (Burghardt et al. 1985).

In a nonexperimental evaluation, however, it may be possible to limit the population of interest to those most likely to be affected by the reforms, so that the impact to be detected is easier to measure. For instance, many of the current waiver evaluations include provisions that affect program eligibility at initial application. States are thus required to randomly assign all AFDC applicants to an experimental or control group. A concern is that the applicant sample includes many applicants who would be denied AFDC benefits under both the new and old versions of the program (and who thus "dilute" estimates of program impacts). A nonexperimental design that compared only approved applicants under the old and new programs would be examining populations with much higher levels of AFDC participation. Thus, assuming the differences between the two groups could be adequately controlled--a big assumption--it would need smaller samples to detect given percentage impacts on participation.

2. State Approaches

Staff members at DHHS typically have specified sample sizes for welfare reform evaluations in the waiver terms and conditions after detailed discussions with the state. These sample size requirements vary according to the state's evaluation objectives and the size of the population being studied. The usual minimum requirements, however, are for the control group to include 2,000 recipient cases and 2,000 approved applicant cases and for the experimental group to be one to two times as large. States may exceed these minimum requirements to improve the precision of their estimates. Usually, sample size requirements do not include specific sample size goals for subgroups. States are required to sample all applicants (not just approved applicants) if the intervention affects eligibility for AFDC, but the sample size requirement is still generally phrased in terms of approved applicants. Thus, the federal requirements generally imply larger sample size requirements when the intervention affects eligibility.(4) Despite this federal guidance, the five evaluations reviewed for this study varied greatly in their planned sample sizes (overall and for key subgroups), as well as in the goals, assumptions, and precision standards used to justify these sample sizes.

a. Planned Sample Sizes

Table III.1 TABLE III.1 PLANNED SAMPLE SIZES IN FIVE STATE WAIVER EVALUATIONS summarizes planned sample sizes in the five evaluations. We first review these planned samples and the data, assumptions, and precision standards used to justify them; later, we discuss how well actual sampling experience has accorded with plans.

Wisconsin. Wisconsin's WNW (the only nonexperimental evaluation) had a planned sample size of 4,000 cases in the demonstration counties and at least 4,000 in the comparison counties (for the part of the evaluation based on a comparison county design). The sample of 4,000 in the demonstration counties was expected to consist of 1,000 recipient cases (the full caseload in those counties) and 3,000 applicants (all applicants over a seven-year period).

The evaluation plan prepared by MAXIMUS discusses the adequacy of the sample size in the WNW evaluation in terms of Cohen's "effect size" measure, defined as the impact on an outcome divided by the standard deviation of the outcome (Cohen 1977).(5) A table shows the sample size needed to detect various effect sizes for one-sided tests with a .05 significance level, at levels of power ranging from 50 to 99 percent. The text notes that a sample of 4,000 each in the demonstration and comparison groups is more than adequate to detect the smallest shown effect size (.10 or 10 percent of the standard deviation of the outcome) at the highest level of power.(6) It is difficult to assess, however, whether an effect size of .10 is realistic for the outcomes being considered without more information. Furthermore, the evaluation plan does not discuss whether the sample is sufficient for the applicant and recipient samples considered separately. The evaluation plan also mentions possibly increasing the size of the comparison group sample

to the full caseload in the comparison counties for outcomes easily measured in administrative data, as one way to add precision to the estimates. The effects of the nonexperimental nature of the evaluation on sample precision are not considered.

California. In California, the required sample size was 15,000 recipient cases (5,000 controls and 10,000 experimental group members). The required sample size for the approved applicant sample was specified as the sample over four years assuming that applicants are sampled using the same sampling rates as used for the recipient cases. The estimated sample of applicants outlined in the sampling plan was 17,280, consisting of 11,520 experimental cases and 5,760 controls.(7) Although we have not found any explicit analysis of precision in the California materials, the large overall sample appears to have been intended to permit subgroup analyses (see Section A.2.b).

Among the five state evaluations, only California planned on unbalanced sample sizes for the two research groups, with two experimental group members for every control group member. Because the demonstration counties had caseloads much larger than twice the control group sample, including additional experimental cases was more feasible than it would have been with smaller sites.(8) The larger experimental group improves the precision of the estimates.

Colorado. In Colorado, the terms and conditions require the following samples: (1) recipients--2,000 experimental and 2,000 control cases, and (2) approved applicants--2,000 experimental and 2,000 control cases. The planned sample sizes described in the evaluation plan are: (1) recipients--2,034 experimental and 2,034 control cases, and (2) approved applicants-- 3,288 experimental and 3,288 control cases. The planned applicant sample was larger than required because the Colorado staff interpreted the sample size requirements in the terms and conditions as referring to the number of cases active two years after implementation. The Colorado sampling plan analyzes precision in terms of the minimum sample sizes needed for county-level estimates but assumes applicant and recipient cases will be pooled for analysis. It does not make clear the need for county-level precision or the rationale for pooling applicant and recipient cases (pooling is discussed further in Section B). The stated precision standard for the analysis is 95 percent power for a one-tailed test; this precision standard is applied to an assumed reduction in recidivism to welfare from 30 to 15 percent.(9) The power requirement of 95 percent is higher than that typically used in evaluation research (80 percent is more common). In addition, recidivism to welfare is not really an appropriate outcome measure on which to base the power analysis, since it is an outcome that can only be defined for a nonrandom portion of the sample (cases that have already exited AFDC).(10)

Michigan. In the Michigan TSMF evaluation, the planned sample size was 21,952--13,578 recipients and 8,374 applicant cases, evenly divided between experimental group members and control group members. The Abt proposal shows that this total sample is adequate to detect a 5 percent impact on earnings under the following assumptions: mean monthly earnings of $165 for controls, with a standard deviation of $244 (based on "a recent study of welfare recipients") and a precision standard for a one-tailed test of a 5 percent significance level and 80 percent power. This calculation assumes no increases in variance due to stratification and ignores any reductions from regression adjustment of impact estimates. Again, the assumption seems to have been that applicants and recipients would be pooled.

Minnesota. The MFIP demonstration has four experimental groups and multiple strata; this substantially complicates the relevant power calculations (see Tables III.1 and III.2). Table III.2 presents the full design for the Minnesota sample. Probably because of the complex design of the demonstration, the terms and conditions of the MFIP evaluation have an explicit precision standard, unlike those in the

TABLE III.2PLANNED SAMPLE SIZES FOR THE MINNESOTA MFIP EVALUATION, BY SUBGROUP

other evaluations that we have reviewed. The terms and conditions state that samples must be adequate to detect experimental-control differences in major outcomes equal to 20 percent of the standard deviation of the outcome at a 5 percent significance level with 80 percent power.

The MFIP evaluation design report argues that the proposed MFIP sample design can meet this standard in comparisons of any two experimental groups with 2,000 cases each, using the employment rate as the key outcome, a two-tailed test, an assumed mean of 50 percent employed in the control group (which the authors say is consistent with other MDRC studies), and assumed gains from regression adjustment of the impact estimate equivalent to a regression equation with an R-squared equal to .08 (which they also say is consistent with experience).(11) This calculation assumes pooling of applicant and recipient cases, and no increases in variance (often referred to as "design effects") due to stratification of the sample.(12) The two smaller research groups (E2 and C2) are roughly 2,000 cases each, but E2 is stratified by county. The larger groups (E1 and C1) are well above that level, but they were stratified by urban/rural location and (within these groups) into several other subgroups, with different sampling rates for the different subgroups (see the next subsection). The larger samples in groups E1 and C1 (over 6,000 in each) may balance or outweigh any design effects from stratification.

b. Subgroup Sample Sizes

Other than stratification of the sample between applicants and recipients (discussed in Section B), the only explicit stratifications of the sample in the five studies examined were by site (or grouping of sites, such as urban versus rural) and by single-parent versus two-parent cases. The motivation behind these stratifications generally was to allow more precise estimates for subgroups; the implications for precision of the estimates for subgroups and overall were not explicitly drawn out.

All of the evaluations (except Wisconsin, which is not really comparable because of its quasi- experimental design) to some extent oversampled cases in smaller sites. In three instances, the motivation was to increase the precision for subgroup estimates; in one instance, it was to increase statewide representativeness:(13)

In California, the sample was allocated across counties as follows: 40 percent to Los Angeles (roughly proportional to its relative caseload), and 20 percent each to the remaining three counties of Alameda, San Bernadino, and San Joaquin. This allocation substantially oversamples the latter county in particular. The goal of being able to measure site-specific impacts justified this approach.
Colorado sampled at a higher rate in smaller sites and sampled the full caseload in the smallest county included in the study. The goal of a minimum of 330 experimental and 330 control group cases in each county determined the sampling rates, with additional cases from the largest counties selected to meet the overall sample size goals (and to improve overall precision).
In Michigan, the sample was selected from four offices: two in Wayne County (Detroit) and two in other parts of the state. The entire caseload in the two non- Wayne offices was assigned to the research sample, but only 70 percent of the caseload in the two Wayne county offices was assigned to it. The motivation for this allocation appears to have been to make the sample more representative of the state as a whole, since the proportion of the sample from Wayne County thus resembled the proportion of the state caseload from Wayne County.
Minnesota had an explicit stratification into urban versus rural sites: the full caseload was sampled in the rural sites, but not in the urban sites. The motivation for this allocation was to derive separate estimates for urban versus rural areas.

Two of the evaluations reviewed stratified explicitly by single-parent versus two-parent cases. California set up the sample so that one-third of cases sampled were two-parent cases (AFDC-UP cases), although such cases typically make up less than 15 percent of the caseload. Minnesota also explicitly oversampled two-parent cases (including cases on the state general assistance program and AFDC-UP), relative to their basic sample of single-parent cases in urban areas.(14) Again, no explicit power analyses were offered to justify these sample sizes, but the motivation was clearly to increase the precision of estimates for two-parent cases. This stratification seems sensible, since changes in rules for two-parent families were a major part of the reform packages in these states, and both states had relatively large sample sizes.

None of the evaluators appears to have considered the effects of oversampling of sites or other subgroups on the precision of the estimates for the overall research sample.

c. Planned Versus Actual Samples

The discussion so far has been of planned sample sizes in the five state waiver evaluations reviewed. At this time, it is apparent that actual samples in several of the states are not as large as planned.(15) This problem is discussed further in the next section; here, we note only that not meeting sample goals can seriously reduce the usefulness of an evaluation.

3. Analysis and Recommendations

Our recommendation is that states, in developing their preliminary evaluation sample designs, specify the precision standard that estimates of the key outcomes must meet (rather than minimum sample sizes) and the key outcome measures to which these standards must be applied. In addition, designs should justify the magnitude of the impact they expect to measure and the assumed variance of the outcome measure, which inevitably vary with the nature of the intervention. A reasonable precision standard would be the ability to detect a plausible impact on all applicants or all recipients at a 5 percent significance level with 80 percent power, using a one- tailed test. We do not generally recommend pooling the applicant and recipient samples; in the next section we discuss reasons. In addition, we recommend allowing reductions in sample size due to the increased precision from regression adjustment, particularly if plans for collection of baseline data (discussed further in Chapter V) are also included in the design.(16)

The study's research questions should determine the key outcome on which the sample size is to be based. It may be appropriate, however, for DHHS to recommend "default" assumptions

(based on a review of the existing literature) concerning magnitude of impacts, standard errors, and regression reductions in standard errors for the most common outcome variables. States then could elect to use these outcome measures and associated assumptions, or they could propose others, but they should state and justify their assumptions.

The power of the sample design to detect impacts should be addressed for key outcomes, for the full sample and for key subgroups (particularly for subgroups for which there is an explicit stratification). States also may wish to establish an explicit precision standard for subgroups. States should consider design effects resulting from any oversampling of subgroups; federal officials could also suggest default assumptions concerning the likely magnitude of such effects from a review of previous studies.

B. SAMPLING OF APPLICANTS AND RECIPIENTS

All welfare reform evaluations have been required to sample both from the existing caseload (recipients) and from the flow of applicants. If the welfare reform program affects eligibility, the sample of applicants must include those whose applications are denied or withdrawn, but otherwise the sample may be limited to approved applicants. This section considers the issues involved in allocating the sample between applicants and recipients and in determining the length of time over which applicant sampling is to take place.

1. Issues

In making decisions about the relative sample sizes for applicants and recipients and the timing of applicant sampling, states must consider (1) how the data for the two sample groups will be used, (2) the trade-off between the risks inherent in a long intake period and the need to assess how a program matures, and (3) the ability of the state to forecast the flow of applicants.

a. Role of Applicant and Recipient Samples

Often, DHHS recommends that applicants be sampled using the same sampling intervals used in selecting recipient cases, to avoid the need to calculate weights. If the same sampling rate is used for both groups, active cases in the sample at each point in time are representative of all active cases, without weighting. A major motivation for this approach has been to allow states to meet the federal cost neutrality requirements more easily. For cost neutrality, the active cases in the research sample should be representative of the full active caseload in the research counties at each point in time, so that the impact of the program on AFDC program costs can be assessed.(17) (Although cost neutrality is not the subject of this report, it is important to note here how these requirements shape the sample designs that are also used for impact analyses.) If applicants are sampled at a different rate than recipients, it is still possible to achieve representativeness for cost neutrality purposes by appropriate weighting. As long as the sampling rate for applicants remains the same over time, construction of such weights is straightforward.(18) Whether or not weights are used, as soon as sampling of applicants ceases, the sample is no longer representative of the full active caseload. The active cases that remain in the sample increasingly will underrepresent newer cases.

In the impact analysis, the applicant and recipient samples are each of interest in their own right; in most cases, impact studies analyze data on the two groups separately. Applicant cases experience the new program from the beginning of their spell, and thus are more indicative of long-term effects; they also are representative of the full range of cases that apply for AFDC. Recipient cases give evidence of the short-term effects of welfare reform on the existing caseload.

Because the recipient group represents the stock of cases at a point in time, it is made up of long- term recipients to a much larger extent than the flow of applicant cases. The recipient sample may thus indicate the effects of reform on the most disadvantaged portion of the caseload, particularly if analysis is restricted to long-term recipients.

b. Trade-Offs Between Long and Short Sampling Periods for Applicants

In designing applicant samples, states should consider the following trade-offs between long and short sampling periods:

If the sample of active cases at all points in time is to be representative of the current caseload, it is necessary to continue sampling applicant cases throughout the demonstration. Sometimes, sampling over long periods also is necessary to meet sample size goals. In addition, sampling over a longer period may permit assessment of the impact of a more mature program, since implementation problems may mark the early period.
If the planned sampling of applicant cases is stretched out over more than two years, there are substantial risks for the impact analysis: (1) the length of followup on applicant cases will be shorter, on average, than if sampling was more front-loaded, and may be insufficient to observe full impacts; (2) the intervention may change, resulting in different programs being tested at different times; (3) sampling may be cut off early (if the evaluation is cut back for financial or substantive reasons) and then the applicant case sample may be too small to be useful; and (4) the application rate may fall over time, so that the sample ends up smaller than intended.

c. Forecasting Applicant Flows

Another consideration in the design of applicant samples is that states must accurately forecast the influx of applicants eligible for random assignment to achieve sample size goals in the expected time. Overestimating applicant flows puts states at risk of either having an inadequate sample or needing to extend sampling. Factors to be considered in estimating applicant flows are (1) the proportion of applicants who will have already been through random assignment as part of a previous welfare spell, (2) the proportion of applicants who are transfers from another site (and thus, in most demonstrations, ineligible for sampling), and (3) whether the sample frame includes the full sample relevant for random assignment or excludes some relevant cases. In addition, the ability to forecast general caseload trends is always imperfect; for example, the strong national economy has contributed to greatly reduced AFDC caseloads in many states and thus made it harder to meet sample goals in some evaluations. The intervention itself may reduce applications; this is a substantial problem for the analysis (as discussed in Chapter VI), but it also affects sample sizes. Many of the evaluations reviewed did not take these issues into account and therefore have had difficulty meeting applicant sample goals.

2. State Approaches

Table III.1 shows the planned applicant and recipient sample sizes in four of the five evaluations reviewed (the Minnesota evaluation plan did not provide estimated sample sizes for the two groups). The decisions each state made concerning whether to oversample applicants and the time period for sampling applicants are summarized as follows:

Minnesota is oversampling new applicants in its two core experimental groups to better assess impacts for this group. Wisconsin is also oversampling applicants in the demonstration counties because that is the only way it can meet overall sample size goals in such small counties. California, Colorado, and Michigan all are attempting to sample applicants and recipients at the same sampling rates. Nonetheless, because of differences across the states in projected flows and applicant sampling periods, applicants outnumber recipients in Colorado's design, but recipients outnumber applicants in Michigan's design. California originally expected the applicant sample to be slightly larger than the recipient sample but (for reasons discussed later) actually has a much smaller applicant than recipient sample.
Wisconsin planned to sample applicants over seven years, California over four years, Colorado over two years, and Minnesota over 18 months in urban counties and 24 months in rural counties. Michigan initially planned to sample applicants over two years; the sampling period was extended by two years, however, as new waivers were passed and applicant samples were not as large as expected after the first two years.
The states furthest along in sampling--California and Michigan--found that the flow of applicants was smaller than expected; in Michigan, this may reflect the move from sampling all applicants to sampling approved applicants. Lower-than-expected applicant flows also seem likely to occur in Wisconsin.

The rest of this section discusses these findings in more detail, by state.

Wisconsin. Because of the small size of the demonstration counties in Wisconsin, the state planned an applicant sample that would be three times the recipient sample. It planned to achieve this sample by sampling for seven years. This lengthy intake period seems unrealistic, given the current rapid pace of policy development; indeed, Wisconsin now is proposing a new program (Wisconsin Works) that would supersede WNW. A key problem with the Wisconsin design is that the two demonstration sites are too small to meet sample goals within a policy-relevant time frame.

As of April 1996, the evaluator for WNW had received from the state a list of cases enrolled in the demonstration in the first nine months, but no data on applicants who did not complete their applications or on cases in the comparison counties. Thus, it is difficult to assess how well the demonstration has been achieving the planned sample goals. Anecdotal evidence, however, suggests that the demonstration is having substantial entry effects, which may reduce the sample sizes well beyond those expected. One indication is that the caseloads in the two counties declined nearly 20 percent between the time the demonstration was announced and actual implementation, leading to a recipient sample of 818 cases rather than 1,000.(19)

California. California's APDP sample was designed with the goal of sampling at the same rate from the recipient pool and the applicant flow, to continue to have a representative sample of the caseload in the demonstration counties over time for cost neutrality. To support the cost neutrality calculations, California originally planned to continue sampling for four of the five years of the follow-up period. A second set of waivers, implemented 16 months after the first set, extended the duration of the demonstration; at this time, however, state officials do not plan to extend the sampling period.

The original sampling plan estimated that the applicant sample would be slightly larger than the recipient sample (see Table III.1). The actual flow of applicants into the sample has been roughly one-third of what was expected, however, resulting in an applicant sample much smaller than the recipient sample. As of April 1996, 40 months after implementation, only 5,460 applicants had been sampled (1,824 controls and 3,636 experimental cases). Three reasons for the discrepancy are:

Applications have been declining because of the economic recovery and, possibly, because of the multiple cuts in the maximum benefit. Because California is a "fill-the- gap" state, which allows applicants to fill the gap between the AFDC payment and the need standard with other income without losing eligibility, the cuts in the maximum benefit would not make anyone ineligible, but they could have behavioral effects if participation in AFDC becomes less attractive.
The state uses a sample frame to select the sample that does not include all approved applicants. In particular, the sample is selected from the statewide Medicaid system. Very short-term cases (which may never be entered into the Medicaid system) and cases that are entered into the Medicaid system late are not included in the sample frame. The state believes this problem reduces the size of the sample but does not bias its composition (except that very short-term recipients are excluded).
The state did not accurately estimate the number of cases that would be excluded from sampling because they had received AFDC within the recent past.

Colorado. Sample intake in Colorado has proceeded pretty much according to plan. The state recently stopped intake one month early with an applicant sample of about 6,000 cases, about 600 less than the number planned, stating it had more than met the requirement in the terms and conditions for 4,000 applicant cases.

Michigan. In Michigan, the projected applicant sample after two years was 8,374, but the actual sample intake was about 6,600. The shortfall seems to reflect the omission of denied applicants (those denied for both AFDC and SFA) from the sample. Denied applicants were dropped because Michigan's data system does not retain data on them. In addition, the state has argued that the same cases would be denied all benefits under both TSMF and control group rules; the intervention merely affects whether cases are approved for AFDC or SFA.

The intake period for new applicants was extended for two years largely because of the implementation of new waivers that substantially changed the TSMF program. As a result, the applicant sample may approach the recipient sample in size. Most analyses will examine applicants in the first two years and the second two years separately, however.

Minnesota. In the design for the MFIP evaluation, sample goals are not broken down into goals for applicants and recipients in the usual way. Instead, the sample design discusses a "basic" single-parent sample, which is a proportional sample of recipients coming up for redetermination, reapplicants (defined as those on AFDC in the past three years), and new applicants (defined as those not on AFDC in the past three years) (see Table III.2). In addition, there is a plan to oversample 1,800 single-parent new applicants in urban counties, equally divided between the two larger experimental groups (E1 and C1). Again, new applicants are defined as those not on AFDC in the past three years. The report states that this implies a sampling rate in the urban counties of 13 percent for single-parent recipients, but 80 to 86 percent for single- parent applicants. In the rural counties that are part of the MFIP demonstration, all applicants and recipients are subject to random assignment to one of the two core experimental groups.

MDRC staff members report that the intake for reapplicants and new applicants has exceeded expectations; this led to shortening of the intake period for reapplicants and to assigning a greater proportion of single-parent new applicants and two-parent applicants to nonexperimental groups.

3. Analysis and Recommendations

From the perspective of the impact analysis, there is less interest in pooling the applicant and recipient samples than in analyzing each (and particularly applicants) separately. A pooled sample can be selected to give unbiased estimates of impacts on the caseload over the demonstration period. Such impacts, however, are made up of effects on those already on AFDC before welfare reform and of effects on those entering the system only after welfare reform; in general, it is the latter effect (that on applicants) that is of long-term interest. Therefore, we recommend that sample sizes be sufficient to analyze recipients and applicants separately; this typically implies sampling applicants at a higher rate than recipients.

Because an extended sampling period brings large risks along with large administrative costs, we recommend designing the applicant sampling process to reach the target sample over a two- year period, if possible. A shorter sampling period implies a longer follow-up period, less likelihood of major program changes, and some flexibility to extend sampling if goals are not being met.

Finally, applicant sampling rates need to be set carefully to take into account the exclusion of transfers and those who have been through random assignment before. States should use any available historical data on accessions to predict these rates. This is just one example of the usefulness of longitudinal data on accessions and terminations.

C. SELECTING SITES

The selection of sites is one aspect of sample design that is rarely addressed formally. Only one of the states reviewed selected sites for its welfare reform waiver evaluations even partially through a formal sampling process. None of the states analyzed the precision of its sample estimates as estimates of the state population as a whole and, thus, none attempted to take into account the increased variance due to clustering of its samples in selected sites rather than in the entire state. The federal government has not required states to do this, mainly because of the considerable political and administrative realities that limit site selection. In addition, the state may have other goals for the demonstration (such as assessing whether a program will work under the most favorable conditions). However, the lack of representativeness (also referred to as external validity) can limit the usefulness of the information from a demonstration. Here, we discuss different possible approaches for selecting sites.

1. Issues

Several approaches are possible to selecting sites for an impact evaluation:

Sampling of Sites. This approach involves random selection of sites to be representative of the state as a whole. To ensure inclusion of large and small sites, sites often are selected with probability proportional to (caseload) size (PPS). To ensure geographic diversity or diversity in other important characteristics, stratification may be used, with random or PPS sampling within strata. Formal sampling procedures ensure representative sites. However, when the intervention is a random-assignment demonstration that places considerable administrative burdens on the sites selected, random selection of sites may not be feasible.
Selecting Judgmentally Representative Sites. Instead of using a formal sampling process, sites may be selected judgmentally (from among the subset of sites in which implementation is feasible) so that they are approximately representative of the state as a whole. For instance, in selecting sites, states may work to ensure that all regions are represented, that both urban and rural areas are represented, or that key program variations or types of local economic conditions are included.
Selecting Exemplary Sites. If the major goal of an evaluation is to assess the feasibility of implementing a new program or to study operation of that program, a state may wish to choose those sites that are most interested in operating the new program or that are otherwise best suited to testing the program. Such an approach clearly limits the generalizability of the impact results, but that may be a worthwhile trade-off. In addition, impact results in such circumstances can be seen as program impacts in the "best case"; a finding of little or no impact is thus still informative.

Regardless of how sites are selected, it is always possible later to compare the characteristics of the sites to the state as a whole and to compare the characteristics of the AFDC caseload in the demonstration sites to the state caseload. If the demonstration sites appear reasonably similar to the rest of the state, this makes generalizing net impact results to the state level more plausible. If major differences exist, the evaluator can assess the possible implications of these differences. It may also be possible to reweight the demonstration sample (especially if it is relatively large) to be more representative of the state caseload.

2. State Approaches

The five evaluations range from those that were not concerned with external validity in selecting sites to those that have made a serious attempt to achieve a representative sample:

Wisconsin made no claims of representativeness for the two demonstration counties for the WNW evaluation--Pierce and Fond du Lac; instead, it selected these sites because they were very interested in implementing the demonstration and seemed likely to achieve success (Bloom and Butler 1995). The two counties are small, relatively prosperous, and overwhelmingly white (as is most of the state outside Milwaukee). The state's main goal was to test the feasibility of the new approach. The selection of these sites severely limits the ability to assess impacts, however, even within these two sites. In particular, sample sizes are small, and MAXIMUS was not able to select comparison sites with unemployment rates quite as low as in the two demonstration sites.
California has implemented the APDP/WPDP demonstration in four counties that are judgmentally representative of the state and that contain 49 percent of the state's caseload: Los Angeles, San Bernadino, Alameda (Oakland-Berkeley), and San Joaquin. The first process report states: "Planners chose Los Angeles because of its critical importance to the state, San Bernardino because it is adjacent to Los Angeles, Alameda for its ethnic diversity, and San Joaquin to represent the Central Valley and because of its proximity to Alameda"(UC DATA 1994). Los Angeles and Alameda have large urban areas, while the other two are largely rural. They range in population: Los Angeles has a population of 9 million, San Bernardino and Alameda populations of 1.2 to 1.5 million, and San Joaquin a population of 0.5 million. San Joaquin had the highest percentage of the population on AFDC and the highest percentage of two-parent cases. In California, welfare reform outcomes also are being tracked in the rest of the state, and state staff members are working on methods for reweighting the research sample to make it more representative of the state as a whole (for example, in terms of ethnicity).
Colorado chose 5 research counties from among 13 that applied to be in the demonstration, on the basis of their capacity to implement the demonstration and to represent the state's diverse geographic, economic, and demographic conditions.
Michigan's sample of four offices was designed to be judgmentally representative along the dimensions of gender, race, age, earned income status, months since the case opened, and family size. Sampling rates were set so that the share of cases in Wayne County (Detroit) was the same as in the statewide caseload.
Minnesota's MFIP is operating in 7 of the state's 87 counties, of which 3 are urban (including one Twin Cities county) and 4 are rural. The sample was designed to overrepresent rural cases in relation to the statewide caseload but to choose representative counties within the urban and rural groups. For the urban sample, the state wished to include the county containing St. Paul or that containing Minneapolis (Ramsey or Hennepin); it ruled out Ramsey because it was participating in another demonstration. The state also chose one of the two large suburban counties at random, but ended up including both because the second county offered to pay planning costs itself in order to be included. The rural counties (all remaining counties in the state) were originally to be part of a nonexperimental evaluation, and two clusters of counties were chosen randomly to represent rural counties statewide.(20) When the state moved to an experimental design for the rural counties as well, one of the two clusters was chosen for the demonstration.

In summary, the selection of the rural counties in Minnesota is the only example of sites being selected through a formal sampling procedure. All of the other states except Wisconsin chose sites that were, to some extent, judgmentally representative. Most evaluators also analyze site representativeness after sites have been chosen.

3. Analysis and Recommendations

In the waiver process, there has been relatively little emphasis on selecting a representative set of sites in the negotiations for approval of welfare reform waivers. This is in large part because the federal staff recognize the administrative burdens of implementing random assignment and understand that it may not be feasible for all local welfare offices to assume these burdens. However, the lack of requirements concerning site selection has made it possible for states to select a set of sites that are most likely to implement the reforms successfully and to use the results from these sites to generalize to the rest of the state. Full implementation of the reforms statewide may then produce less positive results. It sometimes makes sense to implement new and untested programs in sites that are most likely to be successful, to determine if the approach is feasible under the best of circumstances. However, it is most useful to research and policy development if such motivations are stated explicitly and if evaluators and policymakers are then appropriately cautious in generalizing the results.

In addition, there may be a trade-off between a state's short-run interest in implementing the evaluation in sites where implementation is relatively easy and the state's longer-term interest. For instance, states that pay little attention to the representativeness of demonstration sites initially may become very concerned about this issue if impact estimates run counter to their expectations. DHHS can play a useful role in encouraging states to take a longer-term perspective on evaluation design, one that encompasses both implementation concerns and the potential ramifications of an unrepresentative group of sites.

We recommend that states, in their sampling plans, spell out the criteria used in selecting site (including whether the goal is approximate representativeness or selecting exemplary sites). We also recommend that states assess how representative the selected sites and their caseloads are of the state and its caseload as a whole. Wherever feasible, we recommend explicit sampling of sites, with some accommodation of administrative concerns (for example, very small sites or sites with special administrative difficulties could be excluded). Finally, we recommend that an analysis of the representativeness of the population in the sites (including both an updated analysis of site characteristics and a comparison of outcomes) be conducted after the fact, as well as during the site selection process. For example, in an intervention that is implemented statewide, but with random assignment in only selected demonstration counties, it may be useful to assess how similar the outcomes for the experimental group in the demonstration counties are to those in the state as a whole.

Notes

(1)1. The variance of the impact estimate is inversely related to T(1-T), where T is the proportion of the sample in the treatment group. All else equal, this variance is smallest when T = .5.

(2)2. In fact, in a comparison site design with two sites, it is impossible to distinguish the effects of the program from the effects of other site-specific factors that do not vary within the site. If a characteristic varies among individuals in a site, that variation can be used to identify its separate effect. Similarly, the larger the number of sites, the more ability there is to sort out other site effects from the effects of the intervention.

(3) Again, such unobserved factors must vary among individuals as well as across sites. Such models also require that a variable exists that predicts participation in the demonstration well (in a comparison site design, this may be equivalent to predicting where people live), but does not otherwise affect the program impact (know technically as an "identifying" variable). Otherwise, if the control variables in both the participation and outcome equations are the same, the predicted value of the participation variable will be either perfectly or highly correlated with the other variables in the outcome equation (since it is a function of those variables). The models also typically make restrictive assumptions about the distribution of the error terms. In referring to the "best of circumstances," we mean that lack of precision remains a concern in these models even when good identifying variables are available and the distributional assumptions are reasonable. Nonexperimental models are discussed further in Chapter VI.

(4)For example, if the federal requirement is for 2,000 approved control group applicants, but only 2 out of 3 applicants are approved, a sample of 3,000 control group applicants may be needed to satisfy the requirement. In an intervention that does not affect eligibility, a similar requirement implies a sample of only 2,000.

(5)The effect size is a way to standardize analysis of statistical power over different types of outcomes measured on different scales. A sample is selected to achieve a certain effect size (for example, to measure an impact equal to 10 percent of the standard deviation of the outcome) with, say, 80 percent power at a 95 percent significance level. The same sample size would be needed to reach a given effect size, regardless of the outcome measure.

(6)MAXIMUS, "Evaluation of the Work Not Welfare Demonstration: Evaluation Plan," pp. III-21-22 and Exhibit III-6.

(7)California Department of Social Services, "APDP Approval Case Sampling Plan," Attachment 1.

(8)Services offered to experimental cases are the same as those offered to most cases; some counties did not even tell caseworkers which cases were in the experimental group.

(9)Colorado Department of Social Services, "Sampling Plan to Implement the Colorado Personal Responsibility and Employment Program," p. 3.

(10)Methods for analysis of recidivism and similar outcomes are discussed in Chapter VI.

(11)MDRC, "Proposed Design and Work Plan for Evaluating the Minnesota Family Investment Program," p. 14.

(12)Technically, the design effect is the ratio of the standard error of an estimate from a complex sample design (for example, a design with oversampling of particular strata) to the standard error of an estimate from a simple random sample of the same size.

(13)Wisconsin sampled the full caseload in both demonstration counties and also may include the full caseload in the comparison counties.

(14)The basic sample in Minnesota is a proportional sample of recipients and applicants--they also oversample new applicants (defined only as applicants who had not been on AFDC for at least three years). The sampling rates in the urban counties were 13 percent for single-parent recipients, 80 to 86 percent for single-parent applicants, and 46 to 53 percent for two-parent applicants and recipients (Knox et al. 1995).

(15)The MFIP evaluation is the exception; it reports an intake of new applicants and reapplicants higher than expected. It is not clear if this indicates an effect of the demonstration or other factors. MDRC responded by cutting the sampling rates or the intake periods for several subgroups.

(16)Bloom (1995) provides a formula for making this adjustment, based on the R-squared expected for the regression equation.

(17)Strictly speaking, active cases in the research sample should be representative of the full active caseload in the state. However, DHHS generally has been willing to assume the sampled sites are representative of the caseload (see Section C).

(18)Weighting schemes become more complicated if sampling rates are changed over time, perhaps because sample intake has been lower than expected, or if weights must also be used for some other purpose (such as adjusting for oversampling of sites or subgroups).

(19)The implications of entry effects for the analysis are discussed in Chapter VI.

(20)The Upjohn Institute, as a consultant to the state, developed a model to select clusters of counties with approximately 1,500 cases each. It used 49 variables to describe each county and selected clusters with the goals of maximizing generalizability to all rural counties in the state and of having pairs of clusters that were well matched. In addition, it restricted the model so that no cluster could contain more than one county that was not interested in participating.

Chapter 4: Implementation of Experimental Evaluations

Even if the design of a welfare reform evaluation is fundamentally sound, the implementation of the evaluation is critical for its overall success. The successful implementation of an experimental welfare reform evaluation generally will include the following features:

Random assignment will occur at the point in time that leads to the desired population of cases being represented in the research sample.
Random assignment will be performed in a manner that is not subject to manipulation by clients or caseworkers.
The control group policies will represent the policies that would have been in place in the absence of welfare reform.
Once random assignment has occurred, experimental and control group cases will continue to receive, for as long as the evaluation is in effect, the policies to which they were assigned originally.

Failure to achieve these goals when implementing an experimental welfare reform evaluation may lead to inaccurate estimates of the impacts of welfare reform policies.

This chapter address four issues concerning the implementation of experimental welfare reform evaluations:

When should random assignment be performed?
How should random assignment be performed?
Once random assignment has occurred, what steps can be taken to ensure that the experimental policies represent genuine welfare reform policies and that the control policies remain the same as the policies that would have been in place in the absence of welfare reform?
Once random assignment has occurred, what steps can be taken to ensure that experimental cases continue to receive experimental policies and that control cases continue to receive control policies?

A. TIMING OF RANDOM ASSIGNMENT

An important issue in implementing an experimental evaluation design is determining when to perform random assignment of cases to the experimental and control groups. The timing of random assignment does not necessarily bias the impact estimates obtained through an experimental welfare reform evaluation. It does, however, determine the population of cases for which those estimates are applicable.

1. Issues

This section focuses on two questions:

When should random assignment occur for recipient cases?
When should random assignment occur for applicant cases?

a. Timing of Random Assignment for Recipient Cases

For recipient cases, random assignment can occur either at a single point in time (when the welfare reform policies are first introduced in the research sites) or over time (as ongoing cases go through the redetermination process).

The chief advantage of random assignment at the time of redetermination is that, immediately following random assignment, cases can be informed in person of their experimental or control status. The provision of this information in person will ensure that recipients are informed of the programs and rules that apply to them and avoid any confusion about the policies to which they are subject.

The chief disadvantage of sampling recipients at the time of redetermination is that this approach excludes cases that do not remain on assistance long enough to reach this point. For example, when assignment occurs at redetermination, the recipient sample may include no cases that have been on assistance for less than six months.

b. Timing of Random Assignment for Applicant Cases

When randomly assigning applicants to treatment or control status, the goal is to include all applicants who could be eligible for assistance under either experimental or control policies, but to exclude applicants who are twice-ineligible-ineligible under both sets of rules. The timing of random assignment before or after eligibility determination may crucially affect whether the sample meets this goal.

If welfare reform does not change eligibility rules at all, then eligibility determination should precede random assignment. Then, only approved applicants are kept in the sample.

If eligibility rules change, there is no ideal solution, unless it is feasible to determine eligibility under both sets of rules. When random assignment of applicants occurs before the determination of eligibility for benefits, then the sample of applicants will likely include twice-ineligible cases. These cases probably will not be affected by welfare reform programs, unless they later become eligible and reapply for welfare benefits. Ignoring the possibility of future changes in eligibility, it is reasonable to assume that the impact of welfare reform on twice-ineligible cases is zero. Under this assumption, the estimated impact of welfare reform on the entire sample of applicants will be smaller than the estimated impact of welfare reform on the sample of eligible applicants--those eligible under at least one set of rules. The extent to which impact estimates are diluted for applicants will depend on how large a portion of the sample of applicants is twice-ineligible.(1)

When eligibility rules change and random assignment of applicants occurs after the determination of eligibility for benefits, then the sample of applicants will be restricted to cases eligible for assistance under the rules used to determine eligibility prior to random assignment. The sample will contain all applicants eligible for assistance under either set of policies only under the following circumstances:

If one set of eligibility rules is strictly broader than the other set of eligibility rules, and the first set of rules is used for the initial eligibility determination, with a second set of rules applied shortly thereafter to cases subject to the narrower set of rules
If eligibility under both sets of rules is determined for each case at the time of random assignment

The first example is of a situation in which one eligibility calculation occurs prior to random assignment and another shortly thereafter, and the second is of a situation in which a dual eligibility calculation occurs prior to random assignment.

States generally will find it most convenient to perform a single eligibility calculation for each case, with no subsequent eligibility calculation occurring until the time of redetermination. Sequential or dual eligibility calculations usually will impose greater administrative burdens on states. For example, a state may prefer to have separate staff administer experimental and control policies to avoid confusion of case status or corruption of the random-assignment process. Performing a sequential or dual eligibility determination while maintaining separate staff could effectively double the staff time needed to determine a case's eligibility for welfare benefits.

Sequential eligibility determinations also may create awkward situations for states. For example, if welfare reform expands eligibility, and initial eligibility is determined under this broader set of rules, the state will need to recalculate eligibility and benefits under the narrower set of rules for cases subsequently assigned to the control group. This second eligibility calculation may result in benefits being lowered or eliminated entirely for cases in the control group. Specifically, to ensure that control cases receive only control group polices following random assignment, benefits would need to be reduced retroactively to account for the broader rules initially applied.

It might be feasible to perform dual eligibility calculations if a computer determined eligibility. This could be done in five steps:

The applicant provides the necessary information needed to determine eligibility under either welfare reform rules or control group rules.
The applicant's eligibility information is entered into a computer.
The computer determines welfare eligibility and benefits under both sets of rules.
If the applicant is eligible for assistance under either set of rules, the computer randomly assigns the case to either the treatment group or the control group.
The computer informs the applicant of the benefits (if any) for which the applicant is eligible and directs the applicant to a caseworker trained to administer the underlying set of policies to which the applicant has been assigned.

2. State Approaches

In this section, we describe the timing of random assignment in the four evaluations featuring an experimental design. We consider random assignment of recipient and applicant cases separately.

a. Timing of Random Assignment for Recipients

Only Minnesota's evaluation implemented random assignment of recipients at the time of redetermination; the other evaluations implemented the random assignment of recipients at a single point in time. In California's evaluation, random assignment of recipients occurred about one month before the implementation of welfare reform throughout the state, with oversampling to allow for errors and attrition; recipients who did not receive welfare benefits during the month of implementation were deleted from the sample. In Colorado's evaluation, welfare reform was not implemented throughout the state, so recipients assigned to the experimental group were the only group subject to welfare reform policies. In Michigan's evaluation, random assignment of recipients occurred at the same time as the implementation of welfare reform throughout the state.

b. Timing of Random Assignment for Applicants

Two of the evaluations implemented random assignment of applicants before eligibility determination, while the other two implemented random assignment after eligibility determination. Both Michigan's and Minnesota's evaluations assigned applicants to experimental and control groups before determining eligibility. In Minnesota's evaluation, random assignment was immediately preceded by baseline data collection; it was immediately followed by eligibility determination (under the rules for the appropriate group) and then separate group orientation sessions for the experimental and control groups. Minnesota's research sample included denied applicants. Denied applicants who had gone through random assignment were excluded from Michigan's sample because of data limitations. Fortunately, Michigan's welfare reform and control group policies differed little in eligibility rules, so the comparability of the experimental and control groups probably was not seriously compromised by the exclusion of denied applicants from the sample.

In California's evaluation, random assignment of applicants occurred after intake and eligibility determination (under experimental group rules, which were identical to the control group rules for applicants). Originally, random assignment of applicants was to occur at intake, but this "proved to be too expensive and disruptive to county operations.(2) Control group members were subject to experimental policies for one or two months. A control case whose initial AFDC benefits were too low would receive a retroactive supplement. A control case whose initial AFDC benefits were too high (the rarer case) would not receive a retroactive reduction. Short-term AFDC recipients in the control group may have left AFDC without being exposed to control group policies. Control cases receiving AFDC-UP who began working over 100 hours after enrollment (but before random assignment) were dropped from the research sample rather than disenrolled from AFDC-UP. To preserve comparability of the experimental and control group samples, the corresponding AFDC-UP cases in the experimental group should also have been dropped from the research sample, but this action does not appear to have been taken.

In Colorado's evaluation, random assignment of applicants occurred one month after cases were observed in the state's data system. Only applicant cases approved for AFDC under the (slightly more restrictive) control group policies were included in the research sample. Experimental cases in the sample of approved applicants were subject to control group policies for one month before the application of experimental group policies.

None of the welfare evaluations we reviewed performed dual eligibility calculations. As states upgrade the computer systems used to administer their assistance programs, this sort of random- assignment process may be more feasible in the future.

3. Analysis and Recommendations

We have reviewed the timing of random assignment in four evaluations featuring an experimental design. Here, we offer separate recommendations for the random assignment of recipients and the random assignment of applicants.

a. Timing of Random Assignment for Recipients

Only Minnesota's evaluation performed random assignment of recipients at the time of redetermination; the other evaluations assigned recipients to experimental and control groups at a single point in time. The latter approach has the disadvantage that it introduces a potential lag between random assignment and a case becoming aware of its experimental or control status. When recipients are all assigned at once, their first knowledge of their experimental or control status will probably be by letter, with an in-person explanation being perhaps weeks or months away. In contrast, random assignment at the time of redetermination allows recipients to be told in person about the policies to which they are subject and to be reminded that other cases in the same region may be subject to a different set of policies. For example, in Minnesota, group orientation sessions were held for recipients who had just gone through random assignment to explain the policies to which they were subject.

Because it is desirable that cases be informed of their experimental or control status immediately following random assignment, we recommend that states consider the option of performing random assignment of recipients at the time of redetermination. However, states should be aware that if they perform random assignment at redetermination, the sample of recipients will necessarily exclude cases that leave welfare between the time of implementation of welfare reform and their next time of redetermination.

If recipients are assigned to experimental or control groups over time, rather than all at once, the implementation of welfare reform in the research sites will necessarily be gradual. The gradual phasing in of welfare reform in the research sites may conflict with a state's desire for a dramatic implementation of welfare reform throughout the state. If a state prefers to implement welfare reform policies all at once in the research sites, then it would be helpful if the state informs recipients in advance that they may be assigned to one of two welfare programs--the welfare reform program or the preexisting welfare program. Recipients should not assume that they will necessarily be subject to welfare reform provisions. Upon going through random assignment, recipients would be notified of the particular policies that apply to them.

b. Timing of Random Assignment for Applicants

We recommend that, if welfare reform substantially changes the rules by which eligibility is determined, states perform dual eligibility calculations before random assignment of applicants. They should include in the sample of applicants all cases eligible for welfare under either experimental or control group rules but exclude cases eligible for welfare under neither set of rules. Dual eligibility calculations are particularly justified if twice-ineligible cases represent a large proportion of all applicants. Including twice-ineligible cases in the sample of applicants increases the number of observations for which the likely impact of welfare reform is zero and may make it more difficult to identify statistically significant impacts for eligible applicants. When eligibility rules are the same for the experimental and control groups, only a single eligibility calculation is necessary. All ineligible cases are excluded from the sample of applicants.

None of the random-assignment evaluations we reviewed included dual eligibility calculations before random assignment. Only one of these evaluations was of a waiver package in which the experimental and control groups faced identical rules for welfare eligibility, although in two other instances eligibility rules were only slightly different for the experimental and control groups.

We recognize that, for ease of administration, states may be prepared to perform only a single eligibility calculation, either before or after random assignment. When welfare reform does not change welfare eligibility rules, we recommend that random assignment of applicants occur after eligibility has been determined. When welfare reform changes eligibility rules, we recommend that random assignment of applicants occur before the determination of welfare eligibility and benefits. If a state follows this pattern, there will be no discrepancy between the initial determination of a case's welfare eligibility and benefits and its assigned experimental or control status. Because random assignment before eligibility determination is likely to introduce twice-ineligible cases into the sample of applicants, we recommend that evaluators track all applicant cases (approved and denied) and report cumulative denial rates for applicants in the experimental and control groups. For the group subject to the broader eligibility rules, the denial rate may provide an approximate measure of the share of applicant cases for which the likely impact of welfare reform is zero.

B. METHOD OF RANDOM ASSIGNMENT

Another important issue in implementing an experimental evaluation design is determining how to perform random assignment of cases to the experimental and control groups. If an inappropriate method is chosen for performing random assignment, the experimental and control groups may not be truly comparable at baseline, and inferences about subsequent impacts from welfare reform may be biased.

1. Issues

This section focuses on two aspects of performing random assignment:

Selecting a method of random assignment
Assessing the performance of a random-assignment method

a. Selecting a Method of Random Assignment

Random assignment should generate no systematic differences between the baseline characteristics of the experimental group and the baseline characteristics of the control group. The simplest way to accomplish this is to assign cases to the experimental or control group with probabilities equal to the proportion of cases in each group. For example, if the experimental group is expected to equal 35 percent of the cases passing through random assignment, each case passing through random assignment should have a 35 percent chance of selection into the experimental group. Neither the cases themselves nor the program staff administering the program should have any say over who is selected for the experimental group or who is selected for the control group. Otherwise, certain cases may be favored to receive particular policies, leading to systematic baseline differences between experimental cases and control cases.

Random assignment could be accomplished in a manner consistent with these principles in several ways. One approach would be to have a computer generate, for each case, a random number from a uniform distribution between 0 and 1. Suppose, for example, that the state wished to assign 35 percent of cases to the experimental group, 35 percent to the control group, and 30 percent to a nonresearch sample. The computer would compare the random number with the case's probability of selection into the experimental group (0.35 when the probability of selection equals 35 percent). If the number were less than or equal to the probability of selection, the case would be assigned to the experimental group. If the number were greater than the probability of selection, but less than or equal to the probability of being selected into either the experimental group or the control group (0.70 when the probability of selection equals 35 percent for each group), then the case would be assigned to the control group. Otherwise, the case would be assigned to the nonresearch sample.

Another method of random assignment would be to assign cases by the social security number (SSN) of the case head. Since SSNs are not entirely random, only digits at or near the end of the number should be used for random assignment. For example, when the probabilities of selection to the experimental group and the control group are each 35 percent, random assignment could be on the basis of the last two digits of the case head's SSN (00 to 34 resulting in assignment to the experimental group, 35 to 69 resulting in assignment to the control group, and 70 to 99 resulting in assignment to the nonresearch sample). This approach carries a slight risk that, if potential applicants are informed of an SSN-based selection rule in advance, their decision about whether to apply for welfare or whom to identify as the case head might be affected, thereby corrupting the process of random assignment.

A third way to accomplish random assignment is to assign cases on the basis of some other number used for administrative purposes, such as a case number. It would be important to determine the manner in which this number is being generated, and in particular whether program officials have any control over its value. Assuming that particular digits were not subject to the control of program administrators, but were indeed generated randomly, the process of random assignment could proceed in a manner similar to random assignment using the case head's SSN. It would be important to ensure that recipient cases not become aware of the selection rule in advance; otherwise, certain cases might decide to leave welfare prior to random assignment, thereby corrupting the selection process.

b. Assessing the Performance of a Random-Assignment Method

An evaluator can detect problems with the method of random assignment in two ways. First, as part of an implementation study, the evaluator can conduct interviews with program staff in the research sites. These interviews can shed more light on how random assignment was accomplished in practice, as well as whether caseworkers or prospective clients had any opportunity to influence (either intentionally or unintentionally) the odds of a case being assigned to the experimental group instead of the control group.

A second method of assessing the method of random assignment is to compare the baseline characteristics of the experimental and control groups as a preface to an impact study. These comparisons would include statistical tests of differences of average levels for experimental and control cases. On average, there should be few statistically significant differences between the baseline characteristics of the experimental group and those of the control group. The occasional detection of a statistically significant difference between experimental and control cases does not prove that random assignment was flawed. However, the more statistically significant differences that are detected between experimental and control cases at baseline, and the larger these differences, the more likely it is that the assignment of cases to the respective categories was not entirely random.(3)

If both interviews with program staff and comparisons of experimental and control cases uncover irregularities in the process of random assignment, then there is a good chance that the process of random assignment was flawed. Otherwise, the assumption that the experimental and control groups are comparable to each other generally can be maintained, and experimental-control differences in subsequent outcomes can be attributed to differential exposure to welfare reform policies.

2. State Approaches

In this section, we describe the method of random assignment in the four evaluations that feature an experimental design.

a. Selecting a Method of Random Assignment

The four random-assignment evaluations used different methods of random assignment. Both Colorado's and Minnesota's evaluations performed random assignment using a random number generated by a computer.

California's evaluation assigned cases by sorting them by case number and then using a random start and interval sampling to determine membership in the experimental and control groups. Because case numbers are assigned sequentially in each county, this method ensured that the experimental and control samples of recipient cases had exactly the same proportion of cases of different welfare tenures. If another method of random assignment had been used, there would be no guarantee that the tenure distribution of cases would be identical for the experimental and control groups, although on average the distribution of tenures would be the same for both groups.

Michigan's evaluation assigned recipient cases using the eligibility worker's caseload identification number rather than the case number. As a result, recipient cases were assigned experimental, control, or nonresearch status in groups rather than as individual cases. Caseworkers then specialized in administering the welfare rules to which their clients had been assigned. The main advantage of this method was that it ensured that recipients, and in particular recipients in the control group, did not experience the disruption of having to change caseworkers because of the implementation of the evaluation. Random assignment of applicant cases was performed using the last two digits of the head's SSN.

b. Assessing the Performance of a Random-Assignment Method

As required by the terms and conditions of the Section 1115 welfare waivers, all of the evaluators assessed the method of random assignment both through interviews with program staff and through statistical comparisons of the baseline characteristics of experimental and control cases. Neither Colorado's nor Minnesota's evaluations reported any concerns about the implementation of random assignment. California's evaluation experienced some problems in sampling applicants; originally, individuals (rather than cases) were sampled, and some previous recipients were sampled as "new" applicants. In Michigan's evaluation, while some statistically significant differences were detected between the baseline characteristics of the experimental group and the baseline characteristics of the control group, none of these differences exceeded two percentage points. Possible explanations for these differences include mere chance, the random assignment of recipients in groups rather than as individual cases, and the exclusion of denied applicants from the analysis sample.

3. Analysis and Recommendations

We have reviewed the method of random assignment in four evaluations featuring an experimental design. Here, we present recommendations on two issues: (1) selection of a method of random assignment, and (2) assessment of the performance of the method of random assignment.

a. Selecting a Method of Random Assignment

The four evaluations we examined differed in the method they used to perform random assignment. California's evaluation used an interval sampling approach relying on sequential case numbers with a random start. Colorado's and Minnesota's evaluations used computer-generated random numbers to perform random assignment. Michigan's evaluation used the eligibility worker's number to perform random assignment of recipients and the case head's SSN to perform random assignment of applicants.

All of these approaches appear to have been acceptable means of performing random assignment, but each has advantages and disadvantages. Use of a computer-generated random number best ensures that cases will not learn of their experimental or control status before random assignment, even when cases go through random assignment over a period of a time. The main disadvantage of this approach is that program administrators may lack the resources to generate a random number for each case at the time of random assignment.

Interval sampling has the advantage of being a simple and straightforward approach. If case numbers are assigned sequentially, interval sampling will produce experimental and control groups in which cases of different welfare tenures are guaranteed to be represented in the same proportions in both groups. The disadvantage of this approach is that steps must be taken to ensure that the interval sampling begins with a random starting point and that the subsequent assignment of recipient and applicant cases to the list used in sampling is immune from manipulation by state employees or the cases themselves. An example of such manipulation would be program staff altering the sequence in which applicants are entered on the list.

Random assignment of recipients by eligibility worker number has the advantage of preserving the relationship between caseworkers and ongoing cases. This relationship is more likely to be disrupted by random-assignment approaches that assign recipients individually, since cases may then be moved to new eligibility workers or even to new welfare offices. The main disadvantage of random assignment by eligibility worker number is that cases are assigned experimental and control status in groups rather than on an individual basis, thereby increasing the possibility of large differences in the baseline characteristics of the experimental and control group samples.

Random assignment by the last two digits of the case head's SSN has the advantage of employing a number readily available from a state's administrative records. This approach introduces a slight risk of giving advance notice to recipients and potential applicants of the rules by which experimental or control status is determined. Anticipating their experimental or control status, recipients may decide to leave welfare before random assignment, and potential applicants may decide either not to apply for assistance or to apply with another individual identified as the case head.

b. Assessing the Performance of a Random-Assignment Method

In all the experimental evaluations reviewed, an implementation study documented program administrators' perspectives on the process of random assignment, and the impact study compared the baseline characteristics of cases in the experimental and control groups. We believe that both of these investigations are valuable in uncovering any irregularities in the method of selecting of the experimental and control groups. We recommend that evaluators continue to use both approaches for monitoring the implementation of random-assignment evaluations.

C. ENSURING THAT WELFARE REFORM DOES NOT CHANGE CONTROL POLICIES

Another important issue in implementing a welfare reform evaluation is ensuring that the members of the control group are subject to the welfare policies that would have been in place in the absence of the welfare reform program. Even if random assignment proceeds without error, the implementation of welfare reform may alter control group policies:

Services provided to members of the experimental group and other cases subject to welfare reform policies may spill over to cases in the control group.
Services provided to members of the experimental group and other cases subject to welfare reform policies may displace services that, in the absence of welfare reform, would have been provided to cases in the control group.

When control group policies are changed, the resulting measures of the impacts of the welfare reform program will be biased, because the control group is subject to policies and situations qualitatively different from those that would have existed in the absence of welfare reform. In spillover, impact estimates may be too small, since control group members receive some welfare reform policies. In displacement, impact estimates may be too large, since control group members fail to receive services they would have received in the absence of the welfare reform.

1. Issues

This section discusses two issues related to displacement, spillover, and other changes in control group policies:

How can the spillover of welfare reform policies to the control group be avoided?
How can displacement of control group policies by welfare reform policies be avoided?

a. Avoiding the Spillover of Welfare Reform Policies

In situations in which spillover occurs, control group members receive services they would have not received in the absence of welfare reform. A prime example of spillover is the presence of community effects, in which welfare reform affects members of the control group through changes in community institutions or norms. For example, if welfare reform is accompanied by strong community expectations that welfare recipients work, then control group members may be subject to additional social pressure to obtain employment, even if the formal policies to which they are subject have not changed.

The preceding example suggests that a certain degree of spillover is likely, since a welfare reform program usually will be accompanied by changing attitudes in the community. Indeed, these changes in attitudes and expectations are often both the cause and the intended consequence of a major welfare reform initiative. The resulting contamination of control group policies can be reduced by reminding members of the control group of the policies that apply to them, as well as the policies that do not apply to them (even if neighbors, friends, or relatives are subject to welfare reform provisions). Cases can be given these reminders through mailings, group orientation sessions, or regular meetings with the caseworker. The most effective approach probably is meetings with the caseworker, since they occur on a regular basis and provide information in person.

Spillover also can occur as the consequence of administrative error or manipulation, in which program staff administer welfare reform policies to individuals who are still officially in the control group. For example, a welfare reform-related instructional video could be shown to all research cases, because program staff fail to distinguish the experimental and control groups. This type of spillover can be reduced in several ways. To the extent possible, separate staff can administer welfare reform policies and control group policies. In addition, files for experimental and control cases can be clearly

distinguished through the use of different colored folders or other measures, so welfare reform policies are never accidentally applied to control group members.

b. Avoiding Displacement of Control Group Policies

Displacement of control group policies is the opposite of spillover, since it involves control group members failing to receive services they would have received in the absence of welfare reform. An example of displacement would be reductions in job-training services to control group members because of longer waiting lists arising from welfare reform. Usually, any displacement is the unintentional consequence of administering two welfare systems at the same research site. On rare occasions, however, displacement may arise through the intentional actions of program administrators, who may devote greater attention to ensuring that welfare reform "works" and less attention to maintaining the "old" welfare system.

As with spillover, a certain degree of displacement is likely to occur in an experimental evaluation, since the administration of two programs in the same research site may reduce the resources available for administering services to members of the control group. Nonetheless, the risk of displacement may be reduced by seeking to preserve enough administrative resources for the continuation of control group policies and by maintaining separate staff to administer experimental and control group policies to avoid any manipulation by program administrators.

2. State Approaches

In this section, we consider state approaches to the challenges of minimizing the risk of spillover and minimizing the risk of displacement.

a. Minimizing Spillover

The spillover of welfare reform services to the control group was difficult to measure in the evaluations we reviewed. In Colorado's evaluation, welfare reform was implemented only in the demonstration counties; before implementation, there was concern that the provision of JOBS services to control cases would increase because welfare reform-related JOBS expansions would spill over to control cases in the research counties. Subsequent to implementation, the evaluator has been concerned that political pressures may be resulting in the displacement of services to control cases. The evaluators of Michigan's initiative suspected that some spillover had occurred because of the welfare reform program's strong emphasis on preparing individuals for employment, but they were unable to quantify these effects. (Presumably, the amount of spillover in Michigan's demonstration was reduced because caseworkers were assigned to serve only either experimental or control cases.) In Wisconsin, changing community institutions was a major goal of the WNW initiative; the likelihood of spillover to a control group through community effects was a major justification for the adoption of a nonexperimental design for the evaluation.

b. Minimizing Displacement

The evaluations studied differed greatly in the steps taken to reduce the risk of displacement. As noted earlier, Michigan's evaluation assigned recipients on the basis of eligibility worker numbers, thereby preserving relationships between caseworkers and ongoing cases in the control group. In contrast, in California's evaluation, recipients in the control group were frequently assigned to different caseworkers and sometimes to different welfare offices, since the control group was a very small proportion of the county caseload and welfare reform policies were implemented for all other cases. The specialized control group caseworkers, fewer in number than the caseworkers administering welfare reform policies, did not always know the language of their clients and were sometimes located

far away from them. The implementation of welfare reform effectively displaced the relationships that had developed between caseworkers and clients. In addition, as a result of being assigned to new caseworkers, control cases were more likely to be "cleaned" (have eligibility reexamined) than experimental cases continuing with the same caseworkers.

3. Analysis and Recommendations

Both spillover and displacement are threats to a welfare reform evaluation, because they change the policies received by members of the control group and bias estimates of the impacts from welfare reform. It is usually difficult or impossible to adjust impact estimates for this bias, so states implementing welfare reform evaluations should take steps to minimize spillover and displacement.

Certain measures can help states reduce the risk of the spillover of experimental policies to the control group and the displacement of control group policies as a consequence of welfare reform. In particular, we recommend that states keep experimental and control group members well informed, both in writing and in person, of the policies that apply to them. In this way, states can counteract any spillover of attitudes and impressions from the implementation of welfare reform. We also recommend that states administer experimental and control policies using separate but equivalent staff, with minimal disruption and displacement of the services already being provided to recipients in the control group.

Evaluators can gather evidence on the extent to which spillover or displacement has occurred through a study of the implementation of a welfare reform evaluation. Such a study could include interviews with program administrators in which these individuals are asked to provide the following information:

How cases are informed of the policies that apply to them
Whether program staff members process experimental and control cases differently
How program staff members distinguish experimental and control cases
Whether program staff members ever confuse experimental and control cases, applying one group's policies to the other group's members
The extent to which separate staff members handle experimental and control cases
Whether welfare reform has expanded or narrowed the opportunities available to cases in the control group
Whether welfare reform has otherwise affected the way in which policies are administered to cases in the control group

Evaluators can also gain useful insights through interviews with participants about their perceptions of the policies that apply to them and the services they received. The process evaluation of APDP/WPDP in California is one example of a study that carefully considers these issues. This information will not solve the problems of spillover or displacement, but it will provide evidence of the extent to which these problems are present in an evaluation.

D. ENSURING THAT CASES' EXPERIMENTAL/CONTROL STATUS DOES NOT CHANGE

Even if control group policies consistently represent the polices that would have existed in the absence of welfare reform, individuals originally belonging to the control group may be exposed to experimental policies (or vice versa) in some situations:

A case may relocate to a site in which it receives a different set of policies
A case may merge with a case of a different experimental/control status
A case may split off from another case and be assigned a different status
A case's official experimental/control status may be changed as the result of administrative error or manipulation(4)

All of these situations are examples of cases crossing over from one experimental/control status to another. We distinguish migrant crossover cases as cases that experience a change in experimental/control status because of migration; merge/split crossover cases as cases that experience a change in experimental/control status because of a case merger or split; and administrative crossover cases as cases that experience a change in experimental/control status because of administrative error or manipulation. In general, crossover from control to experimental status is more likely when most of the welfare cases in the state are subject to welfare reform policies, while crossover from experimental to control status is more likely when most of the welfare cases in the state are not subject to welfare reform policies.

Regardless of how crossover occurs, the presence of crossover cases in the research sample may result in biased impact estimates because some cases receive the other group's policies instead of the policies to which they were originally assigned. In particular, impact estimates may be too small, since a fraction of original control cases becomes subject to welfare reform policies, and/or a fraction of original experimental cases becomes subject to control group policies. Statistical methods for adjusting for crossover exist (they are discussed in Chapter VI); however, these methods have certain theoretical and practical limitations, so it is in a state's interest to minimize crossover.

1. Waiver Standards for Minimizing Crossover of Cases

The terms and conditions of Section 1115 welfare waivers specify several steps designed to reduce the incidence of cases crossing over from control group policies to experimental group policies (or vice versa):

When a case relocates from one research site to another research site, the case's experimental/ control status is to be preserved.
When a research case splits into multiple cases remaining in a research site, the original case's experimental/control status is to be preserved for the new cases.
When an experimental/control case merges with another case in the same research site, the experimental/control status of the head of the new case is to be preserved.

These standards do not address crossover that occurs through administrative errors in the classification of cases' experimental or control status.

2. State Approaches

Although the reported incidence of crossover is seldom high for welfare reform waiver evaluations, the evaluations we reviewed had some difficulties in minimizing the risk of crossover. Most research sites were not next to each other, because the desire to obtain a sample representative of the state took precedence over the desire to reduce migration to nonresearch sites. In the four states with experimental evaluation designs, the lack of contiguous research counties may have increased the risk of crossover through migration.

The absence of experimental/control status information in individual records in state administrative systems (as opposed to case records) often made crossover from splits and mergers more likely by failing to identify individuals with previous membership in a research case. In California's evaluation, for example, county-specific automated systems made it difficult for caseworkers to identify crossovers from other counties in the state; the evaluator was able to achieve this identification by relying on state Medicaid records noting receipt of AFDC during the previous 12 months. In Michigan's evaluation, the state was unable to identify the previous research status of individuals reapplying for assistance (although the evaluator later obtained this information by merging case- and individual-level files).

3. Analysis and Recommendations

Crossover is a potentially serious threat to a welfare reform evaluation, because it blurs the distinction between experimental and control cases and can lead to biased estimates of the impacts from welfare reform. The terms and conditions of Section 1115 welfare waivers have devoted considerable attention to minimizing the risk of crossover, and we recommend that states seek to adhere to the waiver standards for administering experimental or control policies to cases that migrate, merge, or split.

States can reduce the incidence of crossover in welfare reform evaluations in at least three additional ways:

A large portion of the state's welfare population could be included in the research sample. This measure, while costly, would reduce the likelihood of crossover through migration to nonresearch sites. However, it could increase the risk of spillover, if control group cases were a smaller proportion of each worker's caseload.
Research sites could be located in contiguous counties, since migration might be more likely to nearby counties than other counties. However, this could create trade-offs with goals of external validity, since it may be difficult to choose representative sites that are also contiguous.
Statewide administrative records could append original experimental/control status information to individual records as well as case records. This would make it easier to identify, at the time of application or redetermination, individuals who have split off from research cases or who are merging with other cases and make it less likely that individuals' experimental/control status will be changed as the result of administrative error or manipulation.

Notes

(1)The share of applicants that is twice-ineligible can often be approximated by calculating the share of applicants that has always been denied welfare benefits. If welfare reform eligibility rules are strictly broader than control group eligibility rules, then the cumulative denial rate for experimental applicants is a good estimate of the fraction of applicants that is twice-ineligible. If control group eligibility rules are strictly broader than welfare reform eligibility rules, then the cumulative denial rate for applicants in the control group is a good estimate of the fraction of applicants that is twice-ineligible. If one set of eligibility rules is not strictly broader than another, then the fraction of applicants that is twice-ineligible cannot be determined merely from cumulative denial rates.

(2)"UC DATA, "Assistance Payments Demonstration Project: Process Evaluation: Phase I," p.8.

(3)Statistically significant differences between the baseline characteristics of experimental and control cases will be easier to detect when the size of the research sample is larger.

(4)We distinguish this change in status from the situation, discussed in the last section as an example of spillover, of a case retaining its official status but receiving the other group's policies because of administrative error or manipulation. A change in a case's official experimental/control status (unlike spillover) should be readily apparent to the evaluator and will generate systematic changes in the policies applied to the case.

Chapter 5: Data Collection

This chapter addresses three questions concerning data collection in impact evaluations of welfare reform:

What types of baseline data (predemonstration data) are needed on each case, and what are the best methods for collecting these data?
What is the role of follow-up surveys in impact evaluations in which administrative data are available on many key outcomes? What issues should they address? What are appropriate sample sizes and follow-up periods?
What are appropriate standards for quality surveys, in terms of survey administration, response rates, and maintaining sample over time in longitudinal surveys? How should state officials monitor the progress of surveys?

This chapter does not consider the appropriate data collection strategies for the analysis of program implementation or for other evaluation objectives. This focus should be kept in mind in reviewing the recommendations made.

A. BASELINE DATA

Baseline data are data on characteristics and experiences of experimental (or demonstration) and control (or comparison) group members, before the intervention occurs for the experimental or demonstration group, and before the comparable follow-up period begins for the control or comparison group. Such data may be obtained either from administrative records or special data collection efforts. Baseline data are critical to nonexperimental evaluations, since they are needed to control for preexisting differences between the demonstration and comparison groups. In an experimental evaluation, random assignment ensures that the experimental and control groups are the same, on average, in their background characteristics, so controlling for background characteristics is not critical in obtaining unbiased impact estimates. However, baseline data collection merits attention in experimental evaluations for several reasons. Baseline data (1) provide a check on the integrity of random assignment, (2) are used in improving the precision of impact estimates, (3) are used to define subgroups for analysis, and (4) are critical to any nonexperimental analyses (for example, of welfare recidivism, which must be analyzed using nonexperimental methods since not all experimental and control group members leave welfare during the follow-up period). Baseline data may also be important sources of contact information for follow-up surveys. Nonetheless, there have been no explicit standards or requirements for baseline data collection in the federal waiver process.

1. Issues

The rest of this section focuses on baseline data collection in a random-assignment evaluation.(1) Three major issues are involved:

What data should be collected?
When should the data be collected, or (if drawing data from administrative records) what period should the data cover?
What are the relative advantages and disadvantages of collecting baseline data from administrative records, from a special form filled out at intake, or from a survey?

a. Types of Data

The three major types of baseline data are (1) data on background characteristics, (2) identifying information on each individual, and (3) contact information.

Background Characteristics. In an experimental evaluation, detail on background characteristics of sample members is less critical than in a nonexperimental evaluation. However, two kinds of data (which may overlap) generally are very useful: (1) characteristics that define subgroups of interest in the analysis (usually including basic demographic and socioeconomic characteristics), and (2) past histories of the outcomes of interest. In a random-assignment evaluation, multivariate regression models of the outcomes are used largely to reduce the variance of the impact estimates. Variables measuring past histories of the outcomes are generally the most important control variables in such multivariate models because they lead to the largest reductions in the variance of the impact estimate. Data on both case characteristics and past history of the outcome (ideally, over several years) are valuable in assessing the importance of threats to the experimental design, such as crossovers, since they permit assessment of whether the cases that are contaminated or lost from the sample are different from those that remain. Such data may provide enough information to adjust for any experimental-control differences. Other types of background information that are of less central importance include variables that can be used to identify statistical models predicting the effects of program components (see Chapter VI for further discussion). Examples could include data on attitudes toward and knowledge of the welfare system or access to services.

Identifying Information. It is critical to collect enough identifying information at the time of random assignment to ensure that each individual in a case can be tracked in the full range of data systems to be used in the evaluation. In particular, each person's social security number must be carefully entered and verified. This is especially challenging for individuals who are part of applicant cases that are denied benefits or decide to withdraw their application, since eligibility workers have less incentive to obtain full and accurate information on such individuals.

Contact Information. If a follow-up survey is planned, information should be collected at the time of random assignment that will make it easier to contact the case head (the primary adult in the case) for an interview at a later date. Such information includes phone numbers and mailing addresses, as well as names, addresses, and phone numbers for several friends or family members who typically know where the sample member can be reached.

b. Timing of Baseline Data Collection

The issue concerning the timing of baseline data collection is whether it is necessary that data pertain to a period strictly before random assignment or whether the data may cover a period that goes slightly beyond the date of random assignment. If administrative data are being used for baseline characteristics, one concern is that data on new applicants will generally reflect a time slightly after random assignment (for example, if data are extracted only at the end of the quarter), and may thus be affected by the program. If a survey or form filled out at application or redetermination is being used, the issue is whether data must be collected at the time of random assignment (generally no more than a few days or a few hours before random assignment, but with retrospective questions), or whether it is acceptable to collect data within a few days, a few weeks, or even a few months after random assignment.

c. Source of Baseline Data

Good baseline data can be collected either from administrative records or from special forms or surveys, if sufficient planning and resources are devoted to the effort.

Administrative Records. Administrative records are the best source for historical data on outcome variables, especially if the state maintains these records in a consistent format over time. Unemployment Insurance (UI) records data on employment and earnings generally are available. States vary in the quality of these data and in archiving procedures, however, so it may be difficult to obtain these data retrospectively.(2) Administrative data from the AFDC program and related programs from before random assignment generally will be available for ongoing cases, and may be traceable for applicants who participated in the past. Administrative data are less attractive sources for basic demographic data, however. This is because for cases with no previous AFDC history, data as of the end of the month or quarter after random assignment (often, but not always, as entered at application) usually are the only data available. Furthermore, the baseline data entered into the administrative system may be of poorer quality for applicants who are not approved for assistance (if the data are there at all). Administrative data usually are the sources for key identifiers such as social security numbers, but such data again are likely to be of higher quality for approved applicant cases than for denied cases (if data on denied cases are tracked at all); thus, the quality may differ between experimental and control group cases. Finally, administrative data generally are poor sources of contact information.

In principle, administrative systems may be modified to address some of these problems. For example, systems can be modified to record information on denied applicants or to keep certain background variables as recorded at the time of random assignment. Still, such data must be entered by staff for whom they are not immediately useful, and who might also be learning new procedures.

Surveys or Special Forms. Surveys or special forms can be attractive for baseline data collection because they allow collection of information that is not usually in automated data systems. In addition, if timed appropriately, they can be used to obtain data on denied applicant cases. Such special data collection efforts are expensive, but so are modifications to large automated systems. In general, the most useful strategy for collection of baseline data is to collect such data in the program office immediately before random assignment (either in a brief interview with an intake worker or a staff member from the evaluation contractor or through a paper form filled out by the sample member) and then to have staff members review the data for completeness and accuracy.

A telephone survey just after random assignment is a less desirable strategy. Even if sample information is sent to the evaluator quickly, there is often a lag of several months between when the survey begins and when the sample member is located. Surveys several months after random assignment run the risk of lower response rates, contamination by the intervention, and different response rates for experimental and control group members; consequently, they do not provide true baseline data. If the evaluator is not yet chosen and the survey is not yet designed when random assignment begins, there will be further lags before data collection can begin.

If the major reason to do a baseline survey is to obtain contact information for a follow-up survey, and a contact information form was not filled out at application/redetermination, a postcard sent to research sample cases may be acceptable. However, this approach also runs the risk of contamination if there is a differential response rate for treatment and control group members. A small incentive payment to sample members for returning the postcard may help avoid such differences.

2. State Approaches

In the four state evaluations reviewed that have experimental designs, three are relying primarily on baseline data from administrative records:

In California, UC DATA has built a longitudinal file with up to five years of preimplementation data on all cases in the welfare reform research sample, on the basis of data recorded in the state's Medicaid data system. Variables include participation in Medicaid, AFDC, and other programs related to Medicaid eligibility. They also have assembled over five years of historical UI records data on employment and earnings for individuals in research sample cases. Constructing these longitudinal files required a major investment but has led to a research infrastructure that now is supporting a wide range of research projects. Another database records demographic characteristics of individuals from the time the case enters the research sample (generally two months after application for new applicants), using data extracted from the county-level AFDC data systems.
In Colorado, baseline data from administrative records reflected case characteristics in the month of random assignment, except that there was a plan to try to obtain UI records data for a period before random assignment.
In the Michigan TSMF evaluation, information was available from the state's AFDC data system, by person, on basic demographics and on AFDC/SFA participation in the 24 months prior to random assignment. By case, information was available on active welfare status, welfare participation before random assignment, number of children, number of adults, presence of earnings, and so forth, by month. However, little information was available on cases denied for both AFDC and SFA; in the end, these cases were dropped from the sample. The evaluator argued that this exclusion is not a concern, because the intervention largely affects whether a family is approved for AFDC versus SFA, but not whether it is denied for both.

Special forms or surveys were not used to collect baseline information in Colorado and Michigan. California supplemented the administrative data with a telephone survey, and Minnesota relied completely on an intake form:

In California, the plan was to conduct the first telephone survey within a few months after random assignment, but the start of the survey was substantially delayed, limiting its usefulness as a baseline survey. The delay was caused in part by the time it took to obtain sample frame information from county-level automated data systems. Delays in instrument development were also a factor, as many stakeholders were involved in reviewing and adding to the instrument. In practice, the "baseline" or Wave I survey took place about a year after random assignment began for ongoing cases and has continued to lag random assignment substantially for applicant cases. Because of this, the survey results are being used only as descriptive background information on the survey sample, not to provide independent variables for the impact analysis.
Minnesota used a special baseline data collection form, administered to all ongoing and applicant cases just before random assignment. The individual applying for assistance or subject to redetermination would meet with an intake worker to fill out the Background Information Form. The form took about 10 minutes to fill out, and the response rate was 99 percent. For those already or previously on assistance, some data on their public assistance history were entered by intake staff from the automated system. In addition, the client was given a self-administered Private Opinion Survey on issues such as barriers to work, and attitudes toward work and welfare. The response rate for the Private Opinion Survey was 83 percent.

3. Analysis and Recommendations

An ideal evaluation would combine California's pre-program longitudinal data with Minnesota's baseline information form. Not all states have the resources to do this. However, the following steps toward collecting better baseline data should receive priority.

First, we recommend that states conducting random-assignment evaluations collect at least minimal baseline information at intake; DHHS could develop a prototype form to be adapted to each state's needs. The form should be brief and should focus on basic background information, identification information for all family members, and contact information. It should be filled out by the applicant or recipient jointly with a welfare agency staff member just before random assignment and eligibility determination. Use of a baseline form would require recipients to go through random assignment at redetermination (as recommended for other reasons in Chapter IV). If possible, the staff person responsible for these forms should be someone other than the eligibility worker, and obtaining good data on these forms should be designated as a key part of this person's job.

Second, we recommend that states maintain historical data on program participation and benefits in such a way that the data may be linked to create longitudinal files. If feasible, we recommend that states create the longitudinal files. This effort could be linked to the new requirements for lifetime limits on cash assistance, which will require states to move in this direction. When historical administrative data on outcomes are available, states should use these data in their welfare reform evaluations.

B. ROLE OF FOLLOW-UP SURVEYS

Many of the welfare reform changes are likely to have their most immediate impacts on employment, earnings, and public assistance outcomes. Such impacts can be measured through administrative data. Nonetheless, the waiver terms and conditions required states, to the extent feasible, to collect data on outcomes related to family stability and children's welfare. These outcomes are often of considerably policy interest, but cannot readily be measured with administrative data. States have almost always proposed surveys in response to this requirement.

This section considers the appropriate role of follow-up surveys of experimental (or demonstration) and control (or comparison) group members in an impact analysis. Other possible roles for surveys (such as to collect data on program participation as part of a process analysis) are not considered here.

1. Issues

There are four major questions concerning the role of surveys in welfare reform impact evaluations:

Is a follow-up survey needed at all?
If a survey is conducted, what questions should it address?
What standards should be set for sample sizes for a follow-up survey?
What considerations should affect the timing and frequency of follow-up surveys?

This section focuses on designing a survey to meet the goals of the impact analysis; Section C focuses on operational issues related to collecting high-quality data in surveys.

a. Is a Survey Necessary?

If administrative data cover all of the major outcomes of interest in the evaluation, conducting a survey to get at additional outcomes may not be worthwhile. If resources permit, surveys may be used to study the major outcomes in more depth or to obtain data on secondary outcomes. Such surveys, however, may be too expensive for small states or small evaluations to pursue. To be useful to the impact evaluation, enough resources should be available to obtain high response rates (see Section C).

On the other hand, some interventions primarily target outcomes for which there are no readily available administrative data. Examples include "family cap" provisions (under which no additional benefits are awarded when another child is born to someone on assistance), provisions designed to increase school attendance and immunization rates, and changes in the AFDC-UP program that are intended to promote marriage and family stability. Even in these instances, a survey may not be the only option. There are often less readily available sources of administrative data (such as birth records, school records data, or Medicaid records) that may be more cost-effective or may provide data of better quality than survey data. These alternative data sources have limitations as well, but should be considered carefully. The plan for a survey to be done by the evaluator several years after implementation may take the pressure off state agency staff to obtain other administrative data and, thus, have a counterproductive effect. If alternative sources of administrative data are not planned for early on, important opportunities may be lost, since some of these sources (such as school records) require signed consent forms, and such signatures are most effectively obtained at program intake.

b. Scope and Focus of the Survey

One problem with a survey is that it can become a chance to find out "everything we always wanted to know about the welfare population but have not had the chance to ask." Many stakeholders may wish to pursue their own issues through the survey. Once a survey is undertaken, collecting additional information has relatively low cost (up to a point), so the desire to pursue many issues is understandable. For example, most of the interviewer's time in administering a survey is often used in locating the respondent and gaining cooperation, so the cost of adding another 10 minutes to a 30-minute survey may be modest in comparison. Once an interview goes beyond about 45 minutes, however, maintaining respondent cooperation becomes substantially more difficult.

Nonetheless, the more the survey targets key questions of interest, the easier it is to design the survey to get the best results at the lowest cost. For example, if effects of provisions for UP families are of particular interest, it may be useful to oversample these families. If measuring child care costs is a major concern, it may be useful to stratify the sample by the presence of preschool children. If the major goal of the survey is to obtain information on child care and transportation costs, then questions in this area may need to be quite detailed, even if that implies omitting questions on other interesting but less essential topics.

c. Sample Sizes

Appropriate sample size standards for surveys are also a concern. Because survey data collection is so expensive, there is consensus that sample sizes for surveys need not be as large as in the part of the evaluation based on administrative records. As in selection of the overall sample, however, it is useful to be clear about the precision standard being used and the trade-offs between survey cost and precision.

d. Timing and Frequency of Surveys

In determining the timing of surveys, the challenge is to strike a balance between ensuring an adequate follow-up period for the full impacts of welfare reform to be realized and preventing the follow-up period from being so long that locating respondents and accurate recall become major problems. Multiple surveys (as opposed to a single follow-up survey) may be useful if the impact of a program is thought to evolve over time or if the plan is to focus on different issues at different points in time (for example, a first survey to focus on program participation/process issues and a second survey to focus on costs of working). Multiple surveys at relatively short intervals also offer the opportunity to update contact information and thus make it easier to locate sample members at later follow-up points.

2. State Approaches

On the basis of the state evaluations we have reviewed, the surveys planned or in progress in current waiver demonstrations are broad in scope, generally include samples of from 1,000 to 2,000 cases, and are scheduled to occur either at regular intervals or once relatively late in the demonstration period. Information collected tends to include background information and two types of outcomes: (1) economic outcomes not obtainable from administrative data, such as hours of work, wages, participation in non- JOBS education and training, and costs of work such as child care and transportation; and (2) noneconomic outcomes, such as family structure, fertility, health status, health behaviors, and food security. The large number of outcomes pursued in some instances has led to lengthy and expensive surveys. The sample sizes appear modest (particularly for assessing impacts on outcomes such as family structure) because such impacts are expected to be small and therefore more difficult to detect.

The surveys planned or conducted in the five states are described here:

In Wisconsin, MAXIMUS planned to survey all research sample cases after they leave assistance and obtain a job, to obtain further information about their employment and other outcomes not captured in administrative data. Then, MAXIMUS would resurvey these individuals annually. Surveys of those who leave assistance were to be used to obtain more detailed information on sample members' jobs than is available in the wage records data, such as information on health insurance, job type, hours and wages, job satisfaction, advancement potential, job stability, and (for those not working) barriers to work and reasons not working. In addition, the survey was intended to collect data on changes in family composition, although collecting data from vital records was also planned. (This survey was delayed because of difficulties in obtaining a sample frame from administrative data, and had not began as of spring 1996.)
In California, the Wave I "baseline" English/Spanish survey included about 2,200 ongoing cases and has covered about 250 approved applicant cases so far; a Wave II (follow-up) survey is ongoing. Another followup is scheduled for 18 months after the start of Wave II, if funding is available. The surveys are lengthy and comprehensive, although some background items included in Wave I were omitted in Wave II. An additional survey is being conducted that oversamples speakers of four less common languages (Cambodian, Vietnamese, Laotian, and Armenian); it includes the same items as in the surveys of the English/Spanish population plus items concerning English studies and refugee and immigration status and experiences.
In Colorado, the evaluators planned three waves of followup--at 9, 24, and 36 months after random assignment--with small samples; targets at 9 months were 1,500 cases; at 24 months, 1,275 cases; and at 36 months, 956 cases, equally divided between experimental and control cases. A wide range of outcomes was to be covered in the surveys; some would be covered in only one or two of the follow-up interviews, to reduce the interview length.
In the Michigan TSMF evaluation, Abt plans one survey, 48 months after random assignment, of a random subsample of approximately 1,200 cases (600 experimental/600 control), including both ongoing and applicant cases. The focus of the survey is to be on family stability and on employment-related activities (including employment not measured in UI records), as well as on participation in training, education, and community service.
In Minnesota, MDRC plans two rounds of follow-up surveys (to be conducted by RTI). The first followup was designed to include only cases in urban counties, 12 months after random assignment; the sample was to consist of 2,250 cases each in groups E1 and C1, 1,350 in group E2, and 150 in group C2. The second followup is to occur 36 months after random assignment and to be of a sample selected from all seven counties. The first followup focused on employment-training participation, understanding of the program, food use, and job characteristics. The second followup is to cover employment-training participation, total family income, family and child well-being, food use, and attitudes on work and welfare.

All of these surveys have a broad focus, and many include collecting more detailed data on outcomes already available to some extent in administrative data.

3. Analysis and Recommendations

In Chapter II, we discussed the usefulness of narrowing or prioritizing the list of outcomes covered in welfare reform evaluations. Many states have proposed follow-up surveys, in large part to respond to the broad array of outcomes they have been required to examine in the terms and conditions for federal waivers. We recommend that surveys be used more judiciously. In particular, we recommend that states consider other sources of administrative data that may be available as alternatives to surveys; examples include vital statistics and school records. Obtaining information from administrative systems outside the welfare agency presents many challenges, including confidentiality; however, such alternatives may provide more reliable data at lower cost. We also recommend that surveys focus on a few selected topics (except in particularly large or important evaluations, where it makes sense to invest the resources for a broader survey). The goals of a survey should be clearly stated and attainable with the resources planned; poorly designed surveys may be costly but yield little reliable information.

DHHS could help to ensure that particular topics are covered in a similar manner in states that are attempting to tackle similar problems; one approach would be to promote joint effort in instrument design. A good example of how the federal government has played such a role is the demonstrations of cashing out Food Stamps in the early 1990s; the Food and Nutrition Service funded development of a common food use instrument for evaluations in three states.

C. ACHIEVING HIGH RESPONSE RATES

In our review of the five state waiver evaluations, we found varying approaches to surveys; these approaches depended in part on the evaluator's experience in surveys of low-income populations. We also noted that some of the surveys have achieved relatively low response rates; DHHS staff reports that low response rates have been a concern in welfare reform waiver evaluations in other states as well. Low response rates are a particular concern in using a survey for the impact evaluation, since the lower the response rate, the more risk of bias in impact estimates based on respondents alone. Here, we consider appropriate standards for an acceptable response rate for a follow-up survey to be used in estimating impacts, as well as the survey practices that are particularly conducive to achieving high response rates and maintaining sample and data quality over time.

1. Issues

High response rates are critical in surveys that are part of an impact evaluation in order to minimize the potential for nonresponse to bias the impact estimates. Nonresponse may bias the impact estimate because those who do not respond to the survey may experience different program impacts from those who do respond. If nonresponse is not correlated with experimental/control status, estimated impacts are unbiased for those who do complete the survey, but not for the overall population. If nonresponse is correlated with experimental/control status (as, for instance, when experimental group members leave assistance earlier, and are then harder to locate because contact information is more out of date) then the impact estimates will be biased even for respondents.

Thus, there are two major issues:

Should there be a minimum standard for an acceptable response rate in a follow-up survey to be used in developing impact estimates and, if so, what should the standard be, both for initial and (if applicable) later rounds of followup?
What knowledge is there in the evaluation community concerning survey practices that are conducive to achieving and maintaining high response rates, and how can federal and state officials promote use of these practices?

In discussing these issues, we draw heavily on our experiences as working evaluators and on the insights of the expert panel convened for this project.

a. Standards for Response Rates

The Office of Management and Budget sets the standard of a minimum 80 percent response rate in surveys funded directly by the federal government. For surveys of low-income populations, achieving high response rates is a particular challenge. Low-income families tend to be more mobile, often do not have telephones, and may be suspicious of outsiders asking them questions because of concern about losing government benefits. In some areas, large subgroups of the low-income population do not speak English. However, the authors and experts consulted for this report believe that response rates in the 75 to 80 percent range are achievable with low-income populations when quality survey methods are used. Response rates may be raised to around 85 percent with ample resources for tracking and repeated interview attempts, but they rarely exceed this level.

b. Survey Methods that Promote High Response Rates

In the evaluation community, a number of factors are known to be important in obtaining high survey response rates in follow-up surveys of low-income populations:

Initial Contact Information. Obtaining contact information from all sample members at the time of demonstration intake, before any of them can "fall through the cracks," is critical in locating respondents later.
Updating Contact Information. It is important to update the initial contact information at least every 18 months. If the time between the initial intake and the first followup is longer, it is good practice to contact the respondents just to update contact information.
Sophisticated Tracking Methods. Survey research organizations experienced in surveys of low-income populations typically use computerized tracking systems that can check for addresses and phone numbers in multiple databases (such as credit bureau files and motor vehicle registration records). In an evaluation with several rounds of follow-up data collection, good tracking databases can make it worthwhile to attempt to recontact individuals who were nonrespondents in earlier waves, particularly if the response rate on the first wave was less than ideal.(3)

Mixed-Mode Interviewing. Mixed-mode surveys, which initially contact respondents by telephone and then in person if telephone attempts are unsuccessful, achieve higher response rates than telephone surveys alone. Field interviewers are more able to contact individuals without phones and also may be able to locate individuals on the basis of information from neighbors. In-person interviews are generally more expensive than phone interviews, however.
Respondent Payments. Paying respondents for their time in responding to the survey, even if the payment is small, generally fosters higher response rates and greater cooperation in obtaining quality data. One problem in paying respondents on AFDC, however, has been that such payments generally were counted as income in computing benefits unless a specific federal waiver was obtained.

The lack of any of these factors does not necessarily imply in itself that a survey will have poor response rates, but it may be seen as a risk factor that requires careful monitoring. An additional risk factor exists if an organization that lacks a track record in interviewing low-income populations conducts the survey. Experienced survey organizations have staff from the level of survey director to interviewer who are adept in the techniques of reaching low-income respondents, as well as resources to plan and organize survey operations to achieve high response rates within the time frames needed for timely followup.

2. State Approaches

Among the five waiver evaluations reviewed for this report, two (in Wisconsin and Michigan) have not yet started their follow-up surveys. However, we document how each evaluator planned to conduct the survey; in three of the states, we have information on response rates. We first review how well each evaluation did (or expects to do) in following the survey practices discussed earlier, and then the response rates realized.

a. Survey Mode

The University of Colorado is using a mixed-mode approach for follow-up surveys for the CPREP evaluation. They assumed 40 percent of interviews would be in person. Mixed-mode interviewing also is being used in the Minnesota evaluations and is planned in the Michigan evaluation. In California and Wisconsin, only telephone interviews are being conducted. In the Minnesota surveys, RTI is using computer-assisted interviewing both for telephone and in-person interviews (CATI/CAPI).(4) The California survey is a CATI survey; this was the reason given for not using in-person followup. For the Wisconsin evaluation, MAXIMUS originally had planned to do mail surveys with telephone followup, but later switched to doing all surveys by telephone.

b. Respondent Payments

Respondents are being paid in the California and Colorado evaluations, and payments are planned in Michigan. In Minnesota, for the 12-month followup, payments are only being offered to those who, according to Minnesota's AFDC system, are not on assistance or do not have a phone number. For the 36-month followup, all respondents will be paid. There is no mention of respondent payments in the Wisconsin WNW evaluation plan. Payments are countable income in all of these states.

c. Initial Contact Information

Lack of contact information from the time of random assignment has been a major problem for the California and Colorado surveys. In California, the initial baseline interviewing did not begin until a year after the demonstration began, in part because of delays in obtaining sample and in part because the development of the survey instrument was delayed, as many stakeholders requested additions and revisions to the survey.(5) The sampling delays occurred because the sample is selected about two months after intake from the state Medicaid data system, and it then takes about another month before the counties forward initial data on sampled cases to the UC-Berkeley Survey Research Center. These data must be processed and samples selected before interviewing can begin. By the time attempts were made to contact sample members, the contact information from the county case files was about a year old. Despite the use of various tracking methods (discussed more later), the Survey Research Center was at a considerable disadvantage because of the delay before the initial contact was made and the lack of information on friends and relatives (since there was no distinct research sample intake at which such information could be collected).

The situation in Colorado was similar. The evaluator sent out letters introducing the survey and contact information forms to be mailed in as soon as possible. However, because the ongoing case sample was selected before the evaluation contractor was chosen, and because it took time to transfer sample information to the contractor, about four to six months had elapsed between the selection of the ongoing case sample and the mailing of the letters. Only 18 percent of the contact forms were completed and returned (but low response rates are not unusual in mail surveys).

In the Wisconsin WNW evaluation, the evaluator planned to obtain contact information in the survey for the process analysis, which was scheduled to occur about four months after sample intake. However, that survey was delayed because the state data system was going through a major revision, and sample information on recipient cases was not made available to MAXIMUS until nine months after the intervention began. The process analysis survey thus occurred 10 to 12 months after enrollment.

The 12-month survey for the Minnesota MFIP evaluation, in contrast, has had the advantage of drawing contact information from a form filled out at the time of random assignment, and has achieved high response rates, as discussed further later.

d. Follow-Up Contact Information

In Wisconsin, the plan was to conduct annual surveys after the case left assistance; given the two-year WNW time limit on benefits, the first survey would probably occur within 18 months after the initial contact for the process study. Surveys in California and Colorado are planned to occur at 12 to 18-month intervals. In the Michigan and Minnesota evaluations, at least part of the sample would be surveyed only once, as much as three to four years after random assignment. We do not know of any plans to contact sample members during the intervening period.

e. Tracking Methods

Careful use of a range of tracking procedures and of tracking databases can be critical in locating respondents. In California and Colorado, as discussed above, the evaluators were handicapped from the start by delays between when enrollment in the demonstration began and when the survey began. In both California and Colorado, as soon as the survey organizations had the sample information, they sent out requests by mail for contact information to the address in the public assistance records. In California, UC- Berkeley offered a $5.00 incentive for returning the information. The response rate to this request was 30 percent in California and 18 percent in Colorado.

In California, for the Wave I survey, the UC-Berkeley Survey Research Center asked the county welfare departments to check their records on cases they were not able to locate; three of the four counties were able to comply with these requests. In addition, the Survey Research Center used directory assistance and address corrections from the post office, as well as the state Parent Locator system used for child support enforcement. For the Wave II survey, they did not use the Parent Locator system; by that time, however, they had on-line access to credit bureau and motor vehicle registration databases, as well as contact information obtained in the Wave I interview. They did not attempt to contact nonrespondents to Wave I.

In Colorado, the evaluator is relying largely on checks of state automated systems for the AFDC, Food Stamp, and child support enforcement programs, as well as on-line telephone directories. They sometimes have reviewed hard-copy case files for the names of friends or relatives. Although one progress report discusses obtaining credit bureau data, there is no indication this was implemented. The original Colorado survey plans indicated that on each successive wave they would only contact those who had completed the previous wave; this would lead to smaller samples sizes for later waves. There is no indication they have reconsidered this plan at this time, although use of better tracking methods could improve response rates on later waves.

The Wisconsin WNW work plan discusses only directory assistance and public assistance record checks. We have no information on tracking methods planned for the Michigan and Minnesota surveys. However, Abt Associates, who will conduct the Michigan survey, and RTI, who is conducting the Minnesota surveys, are experienced firms with access to a wide range of tracking methods. In Minnesota, MDRC staff reported that cases were tracked and interviewed throughout the state and sometimes even if they had moved to other states.

f. Response Rate Goals and Actual Experience

In the surveys in California and Colorado, response rates have been low enough to put the usefulness of these surveys for the impact analysis in considerable doubt. For the first Colorado follow-up survey, after nearly four months of survey operations, the response rate for the ongoing sample was 41 percent, much lower than the 60 percent goal; the rate was the same for experimentals and controls. Locating respondents was the key problem. In California, the Wave I English/Spanish survey began a year after random assignment and took 10 months to complete for ongoing cases; the response rate was just under 60 percent, but oversampling allowed UC-Berkeley to reach the desired number of completes.(6) Locating respondents again was the major problem. The response rate for Wave II, which began 18 months after Wave I, has been over 80 percent of those reached in Wave I, or about 50 percent of the original sample. No attempts were made to contact, for Wave II, cases who were not interviewed as part of Wave I.

In the Michigan demonstration, the target response rate for the survey is 80 percent. We have no information on actual experience, but the long follow-up period of four years makes this seem an ambitious goal. In Minnesota, the target for the 12-month followup was 85 percent; for the 36-month followup the target is 80 percent. The first follow-up survey in Minnesota achieved a response rate of 84 percent, very close to the goal; MDRC staff members report that they and RTI decided to end the survey slightly below the target to preserve resources for the second follow-up survey. No target response rate was set for the Wisconsin WNW demonstration. The follow-up survey in Wisconsin has not yet occurred.

3. Analysis and Recommendations

We recommend specifying a response rate standard for surveys that are to be used for impact analysis. An appropriate standard would be from 70 to 80 percent (with the lower end of the range for later rounds of followup); such response rates are achievable with the types of practices discussed in Section C.1, including contact information collected at intake and updated regularly. Welfare agencies could also encourage achievement of high response rates by exempting respondent payments from countable income. In addition, it should be standard practice to compare the characteristics of respondents and nonrespondents in the available administrative data, to assess the likely magnitude of response bias.

We also recommend that the sponsors of welfare reform evaluations monitor data collection plans carefully to ensure that survey practices needed to achieve high response rates are being used. In particular, we recommend not approving any survey plan that lacks two or more of the "best practices" described in Section C.1. If the state is not able to invest the level of resources implied by these practices, then the survey may not produce data of sufficient quality for an impact analysis. Finally, we recommend close monitoring of surveys that lack any one of these practices or that are conducted by a survey organization that is relatively inexperienced with low-income populations. If such surveys produce low response rates early on, then states should carefully consider whether to add resources to survey operations (for example, by adding field followup to a telephone survey) or whether to discontinue the survey altogether.

Notes

(1) We focus on experimental evaluations because baseline data collection in these evaluations has typically received less attention. Data collection needs are similar in nonexperimental evaluations with comparison site designs. In nonexperimental designs in which a pre-program sample (or pre-program data on the same sample) serves as a comparison group, data needs typically are greater--comparable to those for the demonstration follow-up period.

(2)UI records data are data collected from employers and maintained by state UI agencies to use in determining whether an individual qualifies for UI benefits. These data include information on all jobs individuals hold in a quarter and total earnings in each quarter for each job. These data generally are available to other state agencies for legitimate research (with appropriate confidentiality restrictions).

(3)For example, in the evaluation of the Minority Female Single Parent Demonstration, there were 12-month and 30-month follow-up interviews. By recontacting all sample members for the 30-month interview, regardless of whether they had completed the 12-month interview, response rates for the 30-month interview were increased from 73 to 80 percent (Rangarajan et al.1992).

(4)In computer-assisted telephone interviewing (CATI), the survey instrument appears before the interviewer on a computer screen, and responses are typed directly into a computer. Skips of particular questions based on particular responses may be programmed into the computer, reducing the scope for interviewer error, and the data entered by interviewers is converted directly into a research database. The availability of portable personal computers has led to recent growth of computer-assisted personal interviewing (CAPI), but this technology is less widely available.

(5)As noted earlier, the impact evaluation in California will rely mostly on baseline data from administrative records.

(6)The Wave I foreign language survey had a higher response rate than the English/Spanish survey (around 70 percent).

Chapter 6: Analysis Methods

A major goal of a welfare reform evaluation is to analyze the reform's impacts. This analysis must take into account several issues. This chapter addresses four analytic issues for welfare reform evaluations:

What efforts, if any, should be made to distinguish impacts of specific welfare policy changes?
What efforts, if any, should be made to estimate entry effects arising from welfare reform?
How should crossover cases be treated for analysis purposes?
Should impacts be estimated for subgroups that are defined by events that occur after random assignment? If so, how?

A. DISTINGUISHING IMPACTS OF SPECIFIC POLICY CHANGES

State welfare reform demonstrations usually include changes in several policies. For example, in recent years, many states have combined work requirements for welfare recipients with measures (such as more generous earned income deductions and higher asset limits) to make work pay. Policymakers and evaluators may want to distinguish impacts of specific welfare policy changes. Such analyses could identify which components of a welfare reform package are most effective in achieving particular goals.

The welfare reform waiver system has recognized the importance of distinguishing impacts of specific policy changes. When multiple welfare reform policies have been implemented under a single welfare reform waiver demonstration, the terms and conditions of Section 1115 welfare waivers have required the evaluator to "discuss the feasibility of evaluating the impact of individual provisions" of the total package. When multiple welfare reform policies have been implemented under separate welfare reform waiver demonstrations, the terms and conditions of Section 1115 welfare waivers have stated that "possible confounding effects from other demonstrations ...must be addressed in detail."

1. Issues

This section focuses on distinguishing impacts from particular welfare policy changes (as opposed to the whole package of changes), either in a random-assignment evaluation or in a nonexperimental one. The two main issues addressed are:

How can impacts of particular policy changes (as opposed to the whole package) be measured using experimental methods?
How can impacts of particular policy changes be measured using nonexperimental methods?

a. Distinguishing Impacts for Policy Changes Using Experimental Methods

The most rigorous way to distinguish impacts for specific policy changes is to employ an evaluation design with random assignment to multiple experimental groups. If there are several experimental groups, each exposed to different sets of policies, and a control group exposed to pre-reform policies, then the impacts of each set of policies can be measured and compared. Without such a design, the direction and relative size of impacts from two sets of policy changes can sometimes be inferred from the impact of both sets together.

Unlike an evaluation design with random assignment to a single experimental group, an evaluation design with random assignment to multiple experimental groups allows impacts to be estimated for multiple sets of policy changes, even if these changes were implemented simultaneously. For example, a welfare reform package may include both expanded earned income deductions and work requirements. Each set of provisions is likely to have positive impacts on employment rates. If an evaluation design included only one experimental group (X1), subject to both provisions, and a control group (N1), subject to neither provision, it would be impossible to distinguish separate impacts on employment. In contrast, if an evaluation design also included two partial experimental groups, one (X2) subject to the earnings incentives but not the work requirements, and the other (X3) subject to the work requirements but not the earnings incentives, the following impacts could be measured:

(X1 - N1) = combined impact of earnings incentives and work requirements
(X2 - N1) = impact of earnings incentives when work requirements are absent
(X1 - X3) = impact of earnings incentives when work requirements are present
(X3 - N1) = impact of work requirements when earnings incentives are absent
(X1 - X2) = impact of work requirements when earnings incentives are present
(X2 - X3) = impact of earnings incentives alone versus work requirements alone

Moving from a two-group experimental design to a four-group experimental design multiplies by a factor of six the number of impact estimates that can be obtained from the evaluation. If a three-group experimental design were used (for example, groups N1, X1 and X2), then three impact estimates could be obtained from the evaluation: (X1 - N1), (X2 - N1), and (X1 - X2). However, it is helpful to be able to estimate X2 and X3 separately, since, because of interaction effects of earnings incentives and work requirements, it will not necessarily be the case that (X1 - N1) = (X2 - N1) + (X3 - N1).

When an experimental evaluation design includes only a single experimental group subject to all welfare reform policies, and a control group subject to none, distinguishing the impacts of separate policy changes is more difficult. Sometimes, even when two sets of policies are implemented simultaneously, there are theoretical grounds for attributing opposite signs (directions) to the impacts of each set. In this case, the sign of the impact estimate indicates which set of policies has the largest impact. For example, expanded earnings incentives are likely to increase welfare participation levels by broadening eligibility, while stricter work requirements are likely to decrease welfare participation levels by reducing leisure or by imposing financial sanctions for noncompliance. If the impact of the combined changes on welfare participation was positive, then the positive impact of expanded earnings incentives must be bigger than the negative impact of stricter work requirements.

In contrast, whenever the anticipated impacts from multiple policy changes are in the same direction, it is impossible to distinguish the impacts of specific changes with only one experimental group. For example, since expanded earnings incentives and stricter work requirements both are likely to lead to higher employment rates, the evaluator cannot assess the contribution of each policy change to the package's overall impact on employment, even with knowledge of the combined impact of these two provisions.

When welfare reform policies are implemented sequentially rather than simultaneously (usually for programmatic rather than evaluation reasons), additional opportunities may be introduced to infer impacts for separate policies in a two-group experimental design. For example, if expanded earnings incentives are implemented immediately, but stricter work requirements are added after 24 months, the first two years of estimated impacts can be attributed to the expanded earnings incentives alone.

Although the staggered implementation of welfare reform policies provides opportunities for inferring impacts of particular changes, care must be taken in determining the groups compared after the implementation of a second set of reforms. Welfare cases assigned to experimental or control groups before the second stage of implementation are likely to be affected by their initial exposure to only the first set of welfare reform policies. Evaluators should distinguish impacts on cases with staggered exposure to the two welfare reform packages from impacts on cases exposed to the packages in combination only.

b. Distinguishing Impacts Using Nonexperimental Methods

Regardless of whether the underlying evaluation design includes random assignment, evaluators can use several nonexperimental methods to attempt to assess the impacts of different welfare reform policies. First, a process study can often identify the components of a program likely to have been most (or least) effective through interviews with program staff and with clients; for instance, such interviews can identify components that were never implemented or that were misunderstood, versus components that were implemented well. Second, impacts of particular provisions of a welfare reform package may be analyzed by comparing outcomes for cases that participate in those components of the overall package (such as a JOBS program) to outcomes for cases that do not participate. Third, the staggered implementation of welfare reform policies in particular sites can help to distinguish impacts of different measures. Fourth, the evaluation design may call for certain research sites to implement only a subset of the total welfare reform package; this allows separate impacts to be estimated in a way similar to the use of a partial experimental group in an experimental design.

When the evaluation design does not incorporate random assignment, nonexperimental methods must be used. Unfortunately, these approaches are less likely than experimental methods to lead to reliable estimates of the impacts of different welfare reform provisions. The main disadvantage of nonexperimental approaches is that the groups being compared most likely differ not only by being subject to different policies, but also in other ways. For example, cases that decide to participate in a program are probably systematically different from cases that decide not to participate.(1) Similarly, when welfare reform policies are implemented in stages, applicants subject to both the first and second stage of reforms probably differ in systematic ways from cases initially subject only to the first stage of reforms. Finally, when different research sites implement different combinations of welfare reform policies, systematic differences probably exist between the sites that are confounded with the impacts of the policy combinations. Although statistical procedures such as multivariate regression can adjust for observed differences between different groups of cases, only random assignment can ensure that the unobserved characteristics of different groups of cases are, on average, the same.

Nonexperimental analysis of component impacts may be most useful as a supplement to experimental estimates of the impacts of the entire reform package. In this situation, the experimental design can be relied upon to determine whether a welfare reform package is associated with statistically significant differences in outcomes. Nonexperimental analyses (particularly process analyses) can help to establish which policy changes appear be most responsible for the observed impacts of the entire package.

2. State Approaches

Of the five waiver evaluations reviewed, only one--Minnesota's MFIP evaluation-- included multiple experimental groups. In the three urban counties participating in the MFIP demonstration, the research sample included two experimental and up to two control groups. The full MFIP experimental group (E1) received both the financial incentives and the case management provisions of the welfare reform package. The partial MFIP experimental group (E2) received the financial incentives portion of the welfare reform package but continued to receive JOBS (job-training) services under the pre-welfare reform rules. The AFDC + JOBS control group (C1) was subject to the full set of control policies, while the AFDC-only control group (C2) was not eligible for JOBS services. By comparing differences between these groups, it is possible to distinguish the impact of the full welfare reform package (E1 - C1) from the impact of the case management portion of the welfare reform package (E1 - E2), the impact of the financial incentives portion of the welfare reform package (E2 - C1), and the impact of current JOBS services (C1 - C2).

In two other states--California and Michigan--a two-group random-assignment design was originally adopted to study impacts from an initial set of welfare reform waivers and was subsequently used to study impacts from a combination of two waiver packages. In California, the APDP was implemented in December 1992 and the WPDP in March 1994. A two-group random-assignment design was adopted, with experimental cases subject to whatever reform policies had been implemented and control cases to neither set of welfare reform policies. Random assignment of applicant cases was scheduled to continue through December 1996. Presumably, cases that went through random assignment before March 1994 could be studied for up to 15 months to infer impacts from the APDP, while cases that went through random assignment after March 1994 could be studied to infer impacts from the combination of the APDP and WPDP. For cases that went through random assignment before March 1994, impacts measured after March 1994 would need to be attributed to the APDP plus the WPDP implemented some time later.

In Michigan, the first set of TSMF provisions was implemented in October 1992, and an additional set of provisions approved under a second waiver was implemented in October 1994. As in California, the evaluation sample consists of a single experimental group and a control group, with the experimental group subject to all welfare reform policies implemented to date. Random assignment of applicants was scheduled to continue until October 1996. The evaluator is planning to distinguish impacts for cases that applied for assistance before October 1994 from impacts for those that applied after October 1994. The evaluator has no plans to compare the impacts of the first package with the impacts of the combination of the two waiver packages, because the characteristics of applicants were different between 1992 and 1994. In addition, there are no systematic plans for distinguishing the impacts of separate waiver provisions within each major reform package, although the timing of particular provisions might allow some inferences to be made. For instance, for recipient cases, work requirements were not implemented until April 1993, but the first set of impacts that the evaluator for this group reported were measured as of October 1993, after work requirements had already been implemented.

Colorado's CPREP program includes a variety of welfare reform provisions in a single package; CPREP is being evaluated using a two-group experimental design. Currently, no efforts are under way to estimate separate impacts of the different provisions of this package.

In Wisconsin, the evaluator proposed distinguishing impacts of individual components by comparing participants in those components. As noted earlier, impacts of individual components estimated through these nonexperimental methods are likely to be less reliable than impacts estimated through experimental methods, because there is no guarantee that the underlying comparisons are between equivalent groups of cases.

All five of the evaluations we reviewed include process studies based in part on interviews with program staff and clients on their experiences with welfare reform. These interviews and related analyses will not enable evaluators to attach numerical values to impacts from separate provisions of welfare reform packages; however, they may help to identify particular provisions of each package that are strongly associated with particular outcomes from welfare reform.

3. Analysis and Recommendations

We recommend that states that want to estimate impacts for separate components of a welfare reform package consider evaluation designs with multiple experimental groups. Such designs (Minnesota's four-group MFIP design is an example) can be more informative to policymakers than the standard two-group experimental design. The major disadvantages of multigroup designs are that they require a larger research sample to achieve the same precision standards as two-group designs and that a state must administer three or more programs simultaneously in the research sites. To reduce the burdens of maintaining a four-group design, states may want to consider adopting a three-group experimental design, defining two experimental groups--a full experimental group subject to all of the welfare reform provisions and a partial experimental group subject to a subset of the welfare reform provisions--in addition to a control group. The policy changes from which the partial experimental group would be exempt would depend on the interests of the state but might include components of the proposed welfare reform package that are especially controversial or untested.

When states introduce a new welfare reform package after an evaluation of an earlier initiative has begun, we recommend that a second research sample be created, if possible; this would preserve the integrity of the research sample used to study the first initiative. The second sample would consist of recipient and applicant cases that are randomly assigned to either the earlier package only or to the combination of policies contained under both packages. Creation of a second research sample would require state officials to administer welfare under three different regimes, but it would make it much easier to distinguish impacts of the first and second set of welfare reform packages, for both recipients and applicants, in both the short and the long term.

If more than a two-group experimental design is not possible, we recommend that evaluators not attempt to estimate impacts for specific welfare reform provisions within the overall package. Our investigation of welfare reform waiver evaluations found no evidence that separate impacts for different welfare reform provisions can be distinguished reliably in the absence of a design with multiple experimental groups. Instead, we recommend that evaluators confine their analysis of separate welfare reform provisions to qualitative inferences obtained on theoretical grounds or through a process study that includes interviews with program staff and welfare recipients.

B. ESTIMATING ENTRY EFFECTS

Even in an experimental evaluation in which random assignment is implemented properly, the validity of impact estimates might be questioned if the adoption of welfare reform has induced substantial entry effects. Entry effects arise when the adoption of a welfare reform package either encourages or discourages applications for welfare, thereby changing the composition of the population of welfare applicants. For example, a welfare reform initiative that expands job- training programs might encourage applications for welfare, while a welfare reform initiative with stringent time limits might discourage applications. Entry effects do not bias impact estimates for the population that applies for assistance following the implementation of welfare reform. Nonetheless, when entry effects are present, impact estimates may not be valid for the population that would have applied for assistance in the absence of welfare reform. The terms and conditions of current Section 1115 welfare waivers state, "The evaluation contractor will explain how entry effects can be determined and will describe the methodology which will be employed to determine the entry effects of " the welfare reform program.

1. Issues

This section considers two issues related to the estimation of entry effects:

Should efforts be made to estimate entry effects arising from welfare reform? If so, what data are needed and how should they be analyzed?
Does the likelihood of entry effects call into question the validity of impact estimates from a random-assignment evaluation?

a. Feasibility of Estimating Entry Effects

One way to infer the direction of possible entry effects is to examine the impact of welfare reform on the exit behavior of recipient cases. As Moffitt (1993) noted, when welfare reform changes the benefits and potential earnings of welfare recipients and applicants, exit effects (effects on the probability of exiting welfare for recipient cases) are likely to be of the opposite sign as entry effects:

The conventional, "static" theory suggests that potential applicants as well as recipients continually compare two variables in making decisions to apply or exit: potential earnings in the private labor market, and the welfare benefit. Empirical research has strongly confirmed this theory, for welfare benefits and potential earnings have been shown repeatedly to have strong positive and negative effects, respectively, on the probability of being on AFDC at a point in time and on the probability of entering the rolls; and the probability of exiting the rolls has been shown to be negatively affected by benefits and positively affected by potential earnings.

For example, the imposition of time limits would tend to lower the expected value of welfare benefits, leading both to higher rates of exits by welfare recipients and to lower rates of application for welfare.(2)

Detecting exit effects in a particular direction may help to infer the direction of entry effects, but obtaining estimates of the size of entry effects requires the analysis of time-series data on applications to a state's welfare program. For example, if the data and analytic resources were available, monthly levels of applications could be studied over a multiyear period, adjusting for time-varying factors such as local unemployment rates, population changes, and the implementation of new policies (such as expansions of eligibility for welfare). To calculate application rates, applications could be compared to the population of potential welfare applicants, which could be estimated (at least in large states) from household survey data. Entry effects would be measured as the extent to which adjusted rates of application differ following the implementation of a welfare reform package. Exit effects could also be measured using aggregate time-series data on the size of the caseload of ongoing welfare recipients. Unfortunately, as with most estimates obtained from nonexperimental analyses, these estimates of entry and exit effects would probably be somewhat sensitive to the control variables included and the statistical assumptions underlying an evaluator's model of applicant behavior.

Moffitt (1992) proposed an experimental approach for measuring entry effects. This approach would involve randomly assigning the welfare reform and control policies to a large number of different sites, then comparing entry and exit rates in the sites. Moffitt notes that such an approach has many practical problems, including the difficulty of obtaining enough sites, the problem of cross-site migration, and the challenge of maintaining stable policies in each site for more than a very limited period of time. He concludes that, to obtain estimates of entry effects, a more feasible method is nonexperimental approaches that use administrative data.

b. Implications of Entry Effects for Results of a Random-Assignment Evaluation

If entry effects are detected, the validity of a random-assignment evaluation for the sample of applicants may be questioned. When entry effects are present, the number and characteristics of applicants are different than they would have been in the absence of welfare reform. As a result, impact estimates from the sample of applicants do not necessarily apply to the cases that would have applied for assistance had welfare reform not been implemented in the research sites. Nonetheless, as noted previously, impact estimates would remain unbiased for the population of actual applicants. If entry effects are small, impact estimates for applicants may still provide a good indication of the effects of the experimental policies on cases that would have applied for assistance in the absence of welfare reform.

Entry effects do not call into question a random-assignment evaluation's results for recipient cases. This is because, when welfare reform is implemented, recipient cases are already on welfare.

2. State Approaches

The five evaluations studied differed in how much they examined entry effects. Two evaluations have devoted substantial attention to this issue. In Wisconsin's WNW evaluation, entry effects are being estimated using aggregate and disaggregate time-series modeling of application behavior before and after the implementation of welfare reform. Early evidence from process analyses suggests that entry effects may be responsible for a large portion of the caseload changes arising from the WNW package; these findings will need to be confirmed through the time-series analyses described above.

In the APDP/WPDP evaluation, entry effects were being estimated using both administrative data and data from the Current Population Survey (CPS).(3) A time-series model of the fraction of "at-risk" women starting a welfare spell is being estimated using data from the early 1970s to the early 1990s, combining CPS data on the number of women at risk of becoming welfare dependent with monthly caseload data on the number of new welfare spells. To investigate exit effects, another time-series model is being estimated of terminations from AFDC, with separate analyses for the AFDC-UP caseload. Approximately 240 observations are being analyzed. The models control for benefit levels, birth rates, real wages, minimum-wage changes, unemployment rates, and key milestones in welfare policy (such as the OBRA changes of the early 1980s, which substantially reduced earnings disregards). In general, policy changes such as those adopted under OBRA were associated with large entry effects. Using this model, caseloads for the period following the implementation of welfare reform are being forecasted and compared with actual caseloads.

In the other states, efforts to determine entry effects were more modest or nonexistent. In Michigan's TSMF evaluation, the evaluator proposed asking questions in the client survey about possible entry effects, but no time-series analyses of applications or terminations were planned. In Minnesota's MFIP evaluation, the importance of entry effects was recognized, but no attempt was made to estimate them: "It was decided that time-series analyses will yield little reliable data since none of the [demonstration] counties are saturating their caseload with MFIP.(4) In Colorado's evaluation, no efforts are being made to estimate entry effects.

None of the evaluations we reviewed are using analysis of exit rates of recipient cases to infer the direction of possible entry effects arising from welfare reform provisions.

3. Analysis and Recommendations

The presence of large entry effects induced by welfare reform can call into question the validity of impact estimates for applicant cases, even if the evaluation features an experimental design. Only two of the five evaluations we reviewed included substantial efforts to study entry effects by analyzing application and termination behavior over time. If longitudinal data on applications are not available, time-series analyses are not feasible. If data are available, there may not be sufficient resources for the analysis in an evaluation largely focused on experimental impact estimates. Even if adequate longitudinal data and analytic resources are available for a particular state, the results of the estimation of entry effects may be sensitive to statistical assumptions employed by evaluators.

Remarkably little research exists on entry effects. Therefore, we recommend additional research on entry effects, which may be separate from random-assignment evaluations of state welfare reform initiatives, since the data collection and analytic needs for each type of study differ. Evaluations of entry effects could look at monthly welfare applications and terminations across several states, using standardized statistical methodologies and data sources such as historical caseload records from the states, the federal Integrated Quality Control System, or the Survey of Income and Program Participation. A major goal of studies using data from several states should be to identify the sorts of policy changes that are most likely to be associated with large entry effects over time. Another goal of such research should be to identify ways to combine nonexperimental entry effect estimates with experimental impact estimates to assess the overall consequences of a state's welfare reform program for applicant cases.

C. TREATMENT OF CROSSOVER CASES

Even under the best circumstances, a fraction of research cases in an experimental welfare evaluation probably will have their original experimental/control status contaminated (as discussed in Chapter IV). Such contamination can arise for several reasons:

Research cases relocate to nonresearch sites and receive different policies.
Research cases merge with other cases and receive different policies.
Cases split off from research cases and receive different policies.
Cases have their experimental/control status altered through administrative error or manipulation.

These cases are commonly called crossover cases, since, in each instance, cases "cross over" from one set of policies to another. Some degree of crossover almost always occurs in a random- assignment evaluation.

Some terms may be helpful in discussing the implications of crossover. Migrant crossover cases are cases that experience a change in experimental/control status as a result of migration to a nonresearch site; merge/split crossover cases are cases that experience a change in experimental/control status as a result of a case merger or split; and administrative crossover cases are cases that experience a change in experimental/control status as a result of administrative error or manipulation. Crossover-type cases are research cases that would be inclined to migrate, merge, split, or otherwise change experimental/control status under at least one of the two sets of policies (experimental and control). Crossover-type cases include actual crossover cases and cases that would have crossed over had they been assigned to the other experimental/control group.

1. Issues

This section considers two analytic issues related to crossover cases:

Should crossover cases be included or excluded from the analysis sample?
Depending on how crossover cases are treated for analysis, should statistical corrections be employed to adjust for a possible bias to impact estimates?

a. Implications of Including or Excluding Crossover Cases from the Analysis Sample

Unless crossover cases leave the state, administrative records on these cases usually will be available even after they migrate to a nonresearch site, split from another case, or merge with another case. Specifically, as long as welfare participation and earnings information are stored in statewide data systems, it will be possible to determine if a research case is participating in welfare or has (UI-covered) earnings. Consequently, researchers will be able to include most crossover cases in the sample used to generate impact estimates.

If crossover cases are included in the analysis sample, the difference in mean outcomes between original experimental cases and original control cases will tend to understate the impact of welfare reform. This dilution results because some cases will have received a mixture of experimental and control group policies. The extent of bias in the impact estimates will depend on the extent of crossover, as well as on the manner in which the impact of welfare reform on crossover-type cases differs from the impact of welfare reform on noncrossover-type cases.

If all crossover-type cases could be identified, then these cases could be excluded from the analysis sample, and impacts could be estimated for noncrossover-type cases only. While these impact estimates would not be representative of impacts of welfare reform on crossover-type cases, they would be unbiased estimates of the impacts of welfare reform on noncrossover-type cases.

In practice, however, crossover-type cases cannot be identified perfectly, since it is uncertain which of the cases in each experimental/control group would cross over if they were subject to the other set of policies. A common practice is to exclude from the analysis sample cases that migrate to nonresearch sites, regardless of their experimental/control status. As long as migration does not depend on experimental/control status, this exclusion will eliminate from the sample both migrant crossover cases and the corresponding group of crossover-type cases. Similarly, to correct for merge/split crossover, all cases that merge or split could be deleted from the sample (although, in practice, it is often difficult to identify all merging or splitting cases within state administrative files).

If crossover behavior does depend on a case's experimental/control status, then excluding crossover cases from the analysis sample will lead to biased impact estimates. The size and direction of the resulting bias will depend on the incidence of crossover and the relationship between experimental/control status and crossover behavior. Biased impact estimates also will result from excluding crossover cases if crossover behavior is correlated with unobserved factors (such as motivation) that affect outcomes. It is not clear whether the bias in impact estimates from excluding crossover cases from the sample exceeds the bias from including these cases in the sample.

b. Statistical Corrections for Crossover

We consider corrections in two situations: (1) when crossover cases are included in the analysis sample, and (2) when crossover cases are excluded from the analysis sample.

Corrections When Crossover Cases Are Included. When crossover cases are included in the analysis sample, impact estimates (obtained as the difference of means between original experimental cases and original control cases) will tend to be diluted, because some original control cases will have been exposed to experimental policies. A proposed correction for this dilution is the Bloom correction (Bloom 1984; Bloom et al. 1993). In its simplest form, this procedure involves dividing the uncorrected impact estimate (the difference in mean outcomes for the experimental and control groups) by one minus the sum of the crossover rates for experimental and control cases. For example, if the crossover rate is 0.05 for experimental cases and 0.15 for control cases, the Bloom correction would involve dividing impact estimates by 0.80. The crossover rate may be measured in at least four ways for experimental and control cases:

As the fraction of experimental cases currently subject to control group polices and as the fraction of control cases currently subject to welfare reform policies. This measure is most appropriate when prior exposure to the other set of policies has little or no effect on current outcomes.
As zero for experimental cases and as the fraction of control cases ever subject to welfare reform policies. This measure is most appropriate when any exposure to welfare reform policies is equivalent to continual exposure to welfare reform policies.
As the fraction of experimental cases ever subject to control group policies and as zero for control cases. This measure is most appropriate when any exposure to control group policies is equivalent to continual exposure to control group policies.
As the fraction of time experimental cases have been subject to control group policies and as the fraction of time control cases have been subject to welfare reform policies. This measure is most appropriate when the impact of welfare reform polices depends on the percentage of time cases are exposed to these policies.

The larger the crossover rate, the larger the difference between the corrected and uncorrected impact estimates.

A major advantage of the Bloom correction is that it can be calculated in a straightforward manner if original experimental/control status is known and if actual crossover behavior is measured accurately. For the Bloom correction to be used, it is not necessary to know whether non-crossover cases are crossover-type cases (that is, whether noncrossover cases would have been crossover cases if they had been assigned to the other experimental/control group).

A major disadvantage of the Bloom correction is that it relies on a restrictive assumption about the similarity of crossover-type cases and noncrossover-type cases. The Bloom correction assumes that, if there were no opportunity for crossover to occur, the impacts of welfare reform would not differ for crossover-type cases and noncrossover-type cases, after controlling for observed characteristics. If impacts would differ for crossover-type and noncrossover-type cases, the Bloom-corrected impact estimate will be biased. It is possible that the impact of welfare reform on crossover-type cases would be larger than the impact of welfare reform on noncrossover-type cases. In these instances, the Bloom-corrected impact estimate will understate the true impact of welfare reform, although not as much as the uncorrected impact estimate. If the impact of welfare reform on crossover-type cases would be smaller than the impact of welfare reform on noncrossover-type cases, the Bloom-corrected impact estimate will overstate the true impact of welfare reform, and the true impact will lie somewhere between the Bloom-corrected estimate and the uncorrected estimate.

Another disadvantage of using the Bloom correction is that the underlying statistical procedure tends to reduce the precision of impact estimates over estimates obtained using an indicator for original experimental/control status. In certain situations, this loss of precision may be substantial.

Corrections When Crossover Cases Are Excluded. When crossover and presumed crossover-type cases (all cases that migrate, merge, or split) are excluded from the research sample, two problems arise that may benefit from the use of statistical corrections. The first is that the exclusion of cases from the analysis sample may introduce sample selection bias, either because of differential crossover between experimental and control cases or because crossover is itself correlated with unmeasured determinants of outcomes. The second problem is that, even if impacts estimated for the restricted sample were unbiased, the restricted sample may not resemble the total sample of research cases and the impacts estimated for this sample may not be the same as the impacts for the full sample.

If exclusion of crossover cases and presumed crossover-type cases introduces sample selection bias, then sample selection correction procedures, such as the Heckman correction, may be employed. Proper use of these procedures requires that variables exist that influence crossover behavior but not the outcomes of interest. Such variables may be difficult to identify, since anything influencing the decision to migrate, merge, or split may also influence program participation decisions, employment, and earnings. As with the Bloom correction, sample selection corrections generally reduce the precision of impact estimates.

Even if we could assume that exclusion of crossover and crossover-type cases from the analysis sample did not generate bias in estimating the impacts on cases that remain, the resulting analysis sample may not be representative of the original research sample. To narrow the differences between these two samples, reweighting schemes may be employed to make the two populations more similar. Unfortunately, any reweighting scheme can make the populations resemble each other across only a limited number of observed dimensions. Even after reweighting the analysis sample, differences between the analysis sample and the entire research sample are likely to remain (for example, in the degree of mobility of the cases in each sample).

2. State Approaches

The welfare reform waiver evaluations we studied had different approaches to including crossover cases in the analysis sample.

Wisconsin's WNW evaluation employed a comparison group design; the only crossover that could arise would involve migration out of a demonstration county. The demonstration counties chosen for the evaluation were small, so crossover to nonresearch counties (that is, counties without provisions such as time limits) was a major concern to the evaluator. The risk of crossover to the comparison counties was reduced somewhat by selecting noncontiguous counties for the evaluation, but crossover to other counties with pre-reform policies remained a concern. Efforts were being made to keep track of cases that had left a particular demonstration county for another location in the state.

For the California evaluations, the evaluator is attempting to track crossover cases throughout the state. AFDC and Medicaid participation information is available for the entire state, but AFDC benefit information is available for cases in the research counties only, which limits the ability to include crossover cases in certain impact analyses.

The Colorado evaluation excluded migrant crossover cases from the analysis, but only if they continued to receive welfare in nonresearch counties. If rates of migrant crossover or subsequent welfare receipt differed for experimental and control cases, then this practice would lead to biases in impact estimates. Merge/split crossover cases were always excluded from the Colorado analysis sample. To the extent that merge/split crossover rates differed for experimental and control cases, this would also lead to biases in impact estimates (although the incidence of merge/split crossover is typically small).

Michigan's evaluation deleted crossover cases and any other cases that left the research sites from the research sample. After processing data corresponding to the three years following the implementation of welfare reform, about one-fifth of the total research sample had been deleted for these reasons (in approximately equal percentages for experimental and control cases). To make the resulting sample more representative of the original research sample, a system of weights was developed that controlled for research site, recipient/applicant status, year of application, and number of adults in the case at baseline. All impact estimates in the fourth annual report (including third-year impacts for recipient cases) were estimated using these weights.

In Minnesota's evaluation, crossover cases were included in the analysis sample. Following standard MDRC procedures, impact estimates were generated using original experimental/control status, without employing the Bloom procedure or other corrections for crossover. The MDRC approach provides a precise lower-bound estimate of the impact of welfare reform on cases in the research sample. No analyses were reported showing the sensitivity of impact estimates to use of the Bloom correction or exclusion of actual crossover cases from the research sample.

3. Analysis and Recommendations

Welfare reform evaluators currently have several ways to approach crossover. For example, MDRC researchers generally include crossover cases in the sample and estimate impacts using an original experimental/control status variable, without any accompanying sensitivity analyses. Abt Associates, Inc., in its work on the Michigan evaluation, excluded all migrants and other crossover cases from the research sample and employed a weighting scheme to make the reduced sample representative of the original research sample.

We recommend that evaluators studying the impacts of a welfare reform initiative include crossover cases in the analysis sample whenever the available outcome data permit. This approach minimizes sample selection bias and avoids the need for reweighting. Including crossover cases in the research sample also makes it unnecessary for the evaluator to identify presumed crossover-type cases for systematic deletion from the sample. By not deleting crossover cases, original experimental cases remain comparable with original control cases. Because cases that subsequently cross over remain in the research sample, impact estimates over time will apply to the same sample of cases.

In addition to including crossover cases in the analysis sample, it is important that evaluators identify the extent to which crossover behavior occurs for the experimental and control groups. As noted earlier, there are at least four ways of measuring crossover. The preferred measure will depend on the nature of the intervention being evaluated. Evaluators need to be clear about how they define crossover; they may also want to consider the sensitivity of crossover rate estimates to the way in which crossover is defined.

As long as the incidence of crossover is low, any sort of statistical correction should generate impact estimates similar to the uncorrected impact estimates. However, if the incidence of crossover is high, it is important that welfare reform evaluations provide sufficient information to compare results obtained using different approaches and methodologies.

When generating estimates of the impacts of welfare reform, we recommend that the primary focus be on impacts estimated using the original experimental/control status of research cases (with regression adjustments applied to increase the precision of these estimates). This approach always provides a lower bound on the true impact estimate. In contrast, the Bloom procedure is less precise and provides downwardly biased estimates if impacts are greater for crossover-type cases than noncrossover-type cases, and upwardly biased estimates if impacts are greater for noncrossover-type cases than crossover-type cases. Knowing the Bloom-corrected impact estimate for each outcome may still be valuable for sensitivity analyses, so we recommend that these estimates be included in appendixes to the impact study reports.

D. ESTIMATING IMPACTS FOR SUBGROUPS DEFINED BY EVENTS SINCE RANDOM ASSIGNMENT

No matter how carefully an evaluator assembles administrative or survey data for an impact analysis, outcome or background data will be missing or invalid for at least a small fraction of the sample. Frequently, however, outcome or background data are missing or invalid for a large portion of the research sample (perhaps 20 percent or more). One reason for this situation is that the data were not collected completely by administrators or survey workers, although the corresponding information is (in theory) available for all relevant cases. Another reason for missing or invalid data is that the outcome itself is defined by program-related events subsequent to random assignment (such as participation in welfare or in job training), so no outcome data can possibly be collected for some cases. Both situations present problems for the analysis of impacts from the welfare reform package.

1. Issues

This section discusses two issues related to the use of samples defined by events since random assignment:

Should impacts be estimated when administrative or survey data are incomplete or clearly incorrect for a large fraction (one-fifth or more) of the sample?
Should impacts also be estimated for subgroups defined by employment or program participation decisions since random assignment? If so, how?

a. Estimating Impacts When Data Are Incomplete or Incorrect

When considering whether to omit observations from the sample because of incomplete or clearly incorrect administrative and survey data, it is important to distinguish (1) background information, and (2) outcomes data.

Missing or Invalid Background Information. Background information may be omitted for certain cases because of omitted or incorrect values in administrative records or because of survey nonresponse, item nonresponse, or invalid responses to baseline surveys.

The presence of certain background information is essential for an observation to be included in the analysis sample. Welfare reform evaluations generally distinguish impacts for experimental and control cases in recipient and applicant samples, so the presence of an original experimental/control status variable and an applicant/recipient variable is essential for every observation. To construct certain outcome variables (such as employment and earnings) a valid SSN is usually required for matching with UI wage records.

In other instances, observations may be included in impact analyses even if background information is incomplete. For instance, information on the demographic characteristics of a household is valuable for the construction of descriptive statistics and for increasing the precision of impact estimates. However, such information is not essential for impact estimates. Regression-adjusted means can still be calculated by imputing the missing background information, or by setting the missing information equal to a default value and adding indicators for the missing values, without introducing bias. Excluding a large number of cases with missing background information risks making the analysis sample less representative of the entire research sample, since particular types of recipient or applicant cases might be less likely to provide valid background information.

Missing or Invalid Outcomes Data. Outcome information may be omitted for certain cases because of omitted or incorrect values in administrative records or because of survey nonresponse, item nonresponse, or invalid responses to client surveys.

The presence of outcome information is usually essential for obtaining impact estimates. Imputing missing values of nonessential background variables is unlikely to bias impact estimates. Imputing values of the outcome variables themselves is more questionable, however, since it assumes that the relationship between background information and outcomes is the same for cases with missing information as for cases with nonmissing information.

If observations with missing outcomes data are excluded from impact analyses, then biased impact estimates may result if the incidence of missing outcomes data differs for experimental and control cases, or if observations with missing outcomes data differ from other observations in some systematic way correlated with the outcome variables. In these situations, use of a sample selection procedure may be possible, provided that at least some background information is available for the cases with missing outcome information and that a background variable can be identified that is correlated with the absence of outcomes data but is not correlated with the outcomes themselves. Assuming such a variable can be identified (which is not certain), correcting for a possible sample selection bias in impact estimates must still be balanced against the loss of precision in impact estimates that such corrections entail.

b. Estimating Impacts for Subgroups Defined by Decisions Since Random Assignment

Even if administrative records and survey data are complete, certain outcomes will only be available for a subset of the research sample that is defined by behavior or events since random assignment. For example, when estimating welfare recidivism rates, the sample must be limited to cases that had left welfare within the follow-up period. This sample is only a part of the entire sample and, if experimental policies induce a different pattern of exits from welfare, the baseline characteristics of experimental and control cases in the subsample will differ. Another example of an analysis using a subsample defined by behavior after random assignment would be an analysis of JOBS participation rates among current welfare recipients.

A random-assignment design may be helpful in dealing with certain problems related to the use of such samples. In particular, experimental status variables may be useful in correcting for sample selection because of decisions since random assignment. For example, if a study was seeking to estimate labor market outcomes for JOBS participants, experimental status could be used in a sample selection procedure as a predictor of whether particular individuals would participate in JOBS.(5)

2. State Approaches

In general, the five evaluations we reviewed handled instances of missing data in the same manner: observations with missing outcomes or background data were not included in the impact analyses. The loss of observations from missing data appears to have been small in most cases. For example, Minnesota's evaluation reported that only one percent of the research sample failed to complete a background information form, and less than five percent of the research sample was excluded from the six-month impact analyses because of missing welfare participation data. The Michigan and Minnesota evaluations proposed using sample selection procedures to adjust impact estimates in instances in which a large portion of the sample contained missing data; in practice, however, these procedures were not employed because the portions of the sample with missing data were small.

For the four random-assignment evaluations, subgroups usually were defined on the basis of baseline characteristics rather than events since random assignment. In certain instances, however, deletions from the sample may have reduced the strict comparability of the experimental and control groups. In Michigan's evaluation, for example, cases active for only one month were deleted from the research sample, reducing the size of the sample by between two and three percent after two years of data had been processed. The evaluator justified this decision because "a one-month eligibility period for AFDC or SFA is somewhat unusual and therefore suspect . . . we assumed that cases active only one month would have left with or without exposure to TSMF and should not be considered part of the demonstration.(6) As noted earlier, denied applicants for both AFDC and SFA were excluded from the Michigan sample because of data limitations; the evaluator argued that there was no evidence that these deletions introduced "important intergroup differences in baseline characteristics" of experimental and control cases.(7)

In one instance, an evaluator reported outcomes for subgroups defined by events since random assignment, but with an important qualification. In its report on six-month impacts from Minnesota's MFIP initiative, MDRC reported outcomes such as welfare benefits of active cases and earnings of employed single parents on welfare. Mean values were distinguished for experimental and control cases, but without any tests of statistical significance of experimental- control differences. A comment in the text noted that "the subset of the MFIP group for whom averages are shown may not be comparable to the subset of the AFDC group for whom averages are shown. Thus, effects of MFIP cannot be inferred from this table.(8)

In Wisconsin, certain outcomes were being collected only for cases that leave AFDC, but the evaluator had not decided procedures for correcting for selection into this sample. Because the WNW evaluation design is nonexperimental, the evaluator is giving more attention to collecting data that could be useful in modeling participation in particular welfare reform programs.

3. Analysis and Recommendations

Properly implemented, random assignment ensures that the baseline characteristics of experimental and control cases are, on average, the same. A major advantage of this equivalence at baseline is that subsequent differences between experimental and control groups can be attributed entirely to exposure to welfare reform policies. When the sample used to analyze impacts from welfare reform is reduced in size, either because of incomplete or incorrect data or because of the analysis of subgroups defined by program-related events since random assignment, the strict comparability of the experimental and control samples may be lost.

The problem of incomplete or incorrect data can be reduced through state efforts to ensure the quality of administrative records and through evaluator efforts to increase survey response rates. We also recommend that evaluators use all observations for which valid outcome data and basic baseline characteristics are present, rather than restricting the sample because nonessential baseline information is missing. If desired, evaluators may impute missing values of nonessential baseline characteristics. By not deleting observations needlessly, both large sample sizes and the representativeness of the overall research sample are preserved. This makes the resulting impact estimates more applicable to the entire population of cases from which the research sample was drawn.

In defining outcomes for inclusion in the impact study, evaluators of state welfare reform programs should adopt analytic strategies that take maximum advantage of the strengths of an experimental design. In particular, we recommend that evaluators define outcomes in ways that enable values to be assigned for all or nearly all recipient and applicant cases in the research sample. When there is interest in a particular outcome for a subgroup defined by events since random assignment (such as recidivism rates for cases that have left welfare), we recommend that alternative outcomes be considered for analysis (such as the number of welfare spells or the percentage of months spent on welfare since random assignment).

Notes

(1)The use of sample selection correction procedures such as the Heckman correction (Heckman 1979) can account for cases' participation decisions. Estimates obtained using these procedures may be sensitive to underlying statistical assumptions, however, and usually are less precise than ordinary least squares estimates.

(2)Sometimes, however, there will be no such correspondence between entry and exit effects, since certain policies (for example, diversion payments or some AFDC-UP expansions) will not apply to current welfare recipients but only to new applicants.

(3)The study of entry effects is separate from the rest of the evaluation. It is being conducted by Professor Michael Wiseman at the University of Wisconsin.

(4)"Manpower Demonstration Research Corporation (1994). Proposal Design and Workplan for Evaluating the Minnesota Family Investment Program. New York: MDRC, p. 28.

(5)If the study also wanted to look at the direct effect of JOBS participation on labor market outcomes, experimental status could be used in a two-staged least squares procedure to predict JOBS participation.

(6)"Werner, Alan, and Robert Kornfeld (1996). "The Evaluation of To Strengthen Michigan Families: Fourth Annual Report: Third Year Impacts." Cambridge, MA: Abt Associates, Inc., p. B-1.

(7)Werner and Kornfeld (1996), p. A-4.

(8)"Knox, Virginia, et al. (1995). "MFIP: An Early Report on Minnesota's Approach to Welfare Reform." New York: MDRC, pp. 4-9.

References

Bloom, Howard S. "Minimum Detectable Effects: A Simple Way to Report the Statistical Power of Experimental Designs." Evaluation Review, vol. 19, no. 5, October 1995, pp. 547-556.

Bloom, Howard S. "Accounting for No-Shows in Experimental Evaluation Designs." Evaluation Review, vol. 8, 1984, pp. 225-246.

Bloom, Dan, and David Butler. "Implementing Time-Limited Welfare: Early Experiences in Three States." New York: Manpower Demonstration Research Corporation, November 1995.

Bloom, Howard S., Larry L. Orr, Stephen H. Bell, and Fred Doolittle. "The National JTPA Study: Title IIA Impacts on Earnings and Employment at 18 Months." Bethesda, MD: Abt Associates, Inc., 1993.

Burghardt, John, Walter Corson, John Homrighausen, and Charles Metcalf. "A Design for the Evaluation of the Impacts of Job Corps Components." Report submitted to the U.S. Department of Labor, Employment and Training Administration. Princeton, NJ: Mathematica Policy Research, Inc., October 1985.

Burtless, Gary. "The Case for Randomized Field Trials in Economic and Policy Research." Journal of Economic Perspectives, vol. 9, no. 2, spring 1995, pp. 63-84.

Burtless, Gary, and Jerry A. Hausman. "The Effect of Taxation on Labor Supply: Evaluating the Gary NIT Experiment." Journal of Political Economy, vol. 86, no. 6, December 1978, pp. 1103-1130.

Cohen, Jacob. Statistical Power Analysis for the Behavioral Sciences. New York: Academic Press, 1977.

Fraker, Thomas, and Rebecca Maynard. "The Adequacy of Comparison Group Designs for Evaluations of Employment-Related Programs." Journal of Human Resources, vol. 22, no. 2, spring 1987, pp. 194-227.

Friedlander, Daniel, and Philip K. Robins. "Evaluating Program Evaluations: New Evidence on Commonly Used Nonexperimental Methods." American Economic Review, vol. 85, no. 4, September 1995, pp. 923-937.

Greenberg, David, and Mark Shroder. Digest of the Social Experiments. Madison, WI: Institute for Research on Poverty, University of Wisconsin, 1991.

Gueron, Judith M., and Edward Pauly. From Welfare to Work. New York: Russell Sage Foundation, 1991.

Heckman, James J. "Sample Selection Bias as Specification Error." Econometrica, vol. 47, January 1979, pp. 153-161.

Heckman, James J., and V. Joseph Hotz. "Choosing Among Alternative Nonexperimental Methods for Estimating Impact of Social Experiments: The Case of Manpower Training." Journal of the American Statistical Association, vol. 84, no. 408, December 1992, pp. 262-80.

Heckman, James J., and Jeffrey A. Smith. "Assessing the Case for Social Experiments." Journal of Economic Perspectives, vol. 9, no. 2, spring 1995, pp. 85-110.

Keeley, Michael C., Philip K. Robins, Robert G. Spiegelman, and Richard W. West. "The Estimation of Labor Supply Models Using Experimental Data." American Economic Review, vol. 68, no. 5, December 1978, pp. 873-887.

LaLonde, Robert. "Evaluating the Econometric Evaluations of Training Programs with Experimental Data." American Economic Review, vol. 76, no. 4, September 1986, pp. 604-620.

Moffitt, Robert A. "The Effect of Work and Training Programs on Entry and Exit from the Welfare Caseload." Institute for Research on Poverty discussion paper no. 1025-93. Madison, WI: Institute for Research on Poverty, November 1993.

Moffitt, Robert. "Evaluation Methods for Program Entry Effects." In Evaluating Welfare and Training Programs, edited by Charles F. Manski and Irwin Garfinkel. Cambridge, MA: Harvard University Press, 1992, pp. 231-252.

Rangarajan, Anu, John Burghardt, and Anne Gordon. "Evaluation of the Minority Female Single Parent Demonstration: Volume II: Technical Supplement to the Analysis of Economic Impacts." Report submitted to the Rockefeller Foundation. Princeton, NJ: Mathematica Policy Research, Inc., October 1992.

Topics

Welfare, Welfare Reform, & TANF | Quality Measurement | Program Evaluation | Development of Data, Surveys, & Indicators

Populations

Low-Income Populations

Location- & Geography-Based Data

State Data

Program

Aid to Families with Dependent Children (AFDC)