An outcome-based performance measurement system consists of four major elements:
Engaging a broad range of stakeholders in structured discussions about all of these elements can be the basis for building performance partnerships in designing and implementing a performance measurement process. Even if one is not trying to achieve consensus, this wider audience provides a better perspective on what is possible in the real world (e.g., what is operationally feasible, what are possible unintended consequences, etc.) and may facilitate future data collection efforts (Hatry, 1999). Based on HHS's experiences in developing the TANF High Performance Bonus guidance and rulemaking and reports from states that have developed outcome-based performance measurement systems, holding inclusive discussions appears to be the preferred approach. Some states have used the process of developing results-based accountability systems to engage the wider public in a discussion about statewide policy goals and priorities (either within or across specific programs) and to build a commitment of public, private, and nonprofit resources toward these ends.
[ Go to Contents ]
Developing an outcome-based performance system starts with identifying the goals of the program. What is the purpose of a program? What are its desired outcomes? How will we know if the program is working? This is not necessarily an easy task for a program with as many purposes and as much flexibility in the possible uses of funds as TANF - it may be difficult to reach consensus on the outcomes that we care about enough to single out for performance measures, particularly if the associated consequences are substantial.
As discussed above, under TANF, states have a great deal of flexibility in selecting how to distribute their funds among the four Congressionally-specified purposes. While all states have invested substantially in promoting work and self-sufficiency, beyond this common core, they have made different choices in the goals they promote: some have invested in programs to support the formation and maintenance of two-parent families, some have focused on preventing teen pregnancies, and others have expanded their supports for all working poor families, whether or not they have previously received cash assistance. Moreover, even within the general area of promoting work, states have made different decisions about how best to achieve this goal. For example, some states have adopted "work-first" programs which encourage recipients to accept any job they can get in order to acquire work experience, while others have encouraged recipients to be more selective and to participate in training to qualify for a job that offers some promotion potential, or that provides health insurance or other benefits. Which of these approaches will be determined to have the best outcomes depends at least in part on the specific measure that is selected.
The question of whether this degree of variation in goals is appropriate and desirable or whether state flexibility in this area should be restricted is a topic for TANF reauthorization discussions. There are any number of potential approaches, including:
[ Go to Contents ]
Once there is agreement upon the goals of a program, the next step is to develop specific measures that reflect these goals. At this stage, both operational and theoretical concerns must be taken into account.
The availability of data is the primary operational concern. Overall goals must be linked to specific measures for which accurate and comparable data are available at the state level in a timely fashion and at a reasonable cost (Brown and Corbett, 1997; Hatry, 1999; Yates, 1997, Zornitsky and Rubin, 1988). Comparability of data means that the measures should detect real differences in program performance across states or localities or over time rather than reflect differences in the quality of data used to calculate the measures. Timeliness of data has not always been considered in the selection of measures, but it is necessary if the performance measurement system is expected to provide policy-relevant feedback to states on the results of their actions. At the July 1999 consultation, states expressed a clear preference for minimizing the cost and burden of data collection by using measures that could be assessed using national survey data or data they were already reporting over measures that would increase their data collection and reporting responsibilities.
Experience to date has shown that data systems at all levels of government fall short of the ideal (Brown and Corbett, 1997). These experiences have found that data for some measures cannot be collected at all, while others can only be measured poorly. Moreover, the cost of developing or improving data collection systems can be substantial. While states are collecting a range of information about TANF recipients beyond that required under the federal reporting rules, they do not all collect the same data elements. Even when the same general information is collected, there is no consistency in how it is measured across states (APHSA, 2000).
Some performance measurement systems have had success with existing administrative data - such as Unemployment Insurance (UI) records - which are collected uniformly across a range of states or localities (Bartik, 1996; Yates, 1997). Administrative data usually attain some level of quality and the cost of collecting the data is limited since they are generally collected for other purposes. However, the types of measures that can be derived from administrative data are limited. For example, UI records include data on earnings over a quarter, but not on hourly wages. Moreover, the quality of administrative data is highest for data elements that are directly related to the purpose for which the data were collected - such as the amount of benefits paid - and lower for other elements - such as the educational level of recipients (GAO, 1997). National or state surveys can also provide data on a wider range of measures, particularly if existing survey efforts can meet the needs of the performance measurement system. However, initiating survey efforts can be relatively expensive. If one is interested in outcomes that are valid at the level of specific states or localities, relatively large sample sizes will be required to achieve this level of precision. Appendix D of this report reviews the merits and disadvantages of several potential data sources for outcome measures.
The theoretical concerns in developing measures are driven by the fact, as noted earlier, that all high-level outcome measures are affected by a range of factors, not just by program performance. This would not be a problem if there were a strong correlation between performance on an outcome measure and program effectiveness, as shown through evaluation. Unfortunately, research has shown that there is not always a consistent relationship between program outcomes and impacts.(2) For example, many welfare recipients find jobs on their own - without the assistance of welfare-to-work programs. The role of welfare-to-work programs is to add value to the "natural" movement off welfare and into employment. States with stronger economies and lower unemployment rates are generally able to move more individuals into employment than those with weaker economies. Similarly, states with a more disadvantaged caseload may have greater difficulty moving individuals into work than states with a more job-ready caseload. Therefore, differences in economic conditions or in caseload composition, rather than in welfare-to-work program effectiveness, may have more to do with performance on an outcome measure.
Appendix A examines this issue in more detail. Using data from random-assignment evaluations of welfare-to-work programs in five sites, it can be seen that there is not a consistent relationship between the programs with the highest employment rates or average earnings - two possible outcome measures - and the programs that produced the greatest impact on these measures. This problem is one of the major issues identified in the 1994 report to Congress (HHS, 1994) that needs to be resolved in order to adopt an outcome-based performance measurement system. However, the research also shows that this problem is not unique to outcome measures - participation rates over time also are poorly correlated with program impacts.
Several lessons can be drawn from this research:
The selection of specific measures inevitably involves trade-offs. The use of multiple measures can help guard against any unintended consequences that might be caused by reliance solely on a single measure. However, it is important not to err by going too far in the other direction - a relatively complex system can have a less immediate effect on motivating programs in any particular direction (Bartik, 1996). It is also important not to lose sight of the program goals and desired outcomes: the measures that have been chosen must reflect the initial choice of goals.
Part III of this report includes a detailed examination of several potential measures that could be used to assess the performance of state Temporary Assistance for Needy Families (TANF) programs. These include the measures that have been selected for the TANF High Performance Bonus.
[ Go to Contents ]
Standards identify expected levels of performance and provide the basis for assessing whether states are achieving program goals and, therefore, should be rewarded or penalized. The standards included in an outcome-based performance measurement system should be challenging yet achievable. Standards appear to be most likely to affect states at the margin - those which are in danger of being penalized or within striking range of receiving a bonus. In a study of JTPA programs, for example, Dickinson and West (1988) found that about 42 percent of the local operating entities they studied tried to maximize their measured performance, one-fourth tried only to slightly exceed their standards, and about one-third tried merely to meet their standards in order to avoid program sanctions. If a standard is set too low, it loses its effectiveness as an incentive for states to improve their performance. If it is set too high, states are likely to be put off by the unreasonable standard. Depending on the consequences for failure, states are likely to either simply give up trying to achieve the unreasonable standard or look for ways to get around it. For example, it appears that the high participation rate requirements for two-parent families on TANF have caused several states to change the way they provide assistance to such families by using state maintenance of effort (MOE) dollars rather than federal TANF funds.(3)
It is extremely difficult to determine an appropriate standard without baseline data on past performance. When data for a specific measure have never been collected or analyzed before, neither state nor federal policymakers are likely to know what would be a reasonable level of performance. In developing the TANF High Performance Bonus, HHS dealt with this issue by rewarding the top states in each category, rather than by establishing a fixed standard. It is still too early to tell whether this approach of rewarding the top performers will motivate the broad middle range of states to improve their performance. One encouraging sign, however, is that in the first year of the High Performance Bonus, a wide range of states (46) elected to submit data to compete for a bonus on one or more of the four measures.
The national average is another method that is used to set a standard, as is the case with the Quality Control system for the Food Stamp Program. At the consultation, state representatives expressed opposition to this approach. They objected to not knowing their performance target up front, and to the possibility that a state could find itself penalty-liable in a given year without experiencing any change in its performance, due simply to changes in other states' performance. (The same concerns would apply to a system that penalized the bottom n performers under a measure.)
One important issue is whether to establish a single nationwide performance standard for each measure or to adjust standards to account for differences in economic and demographic circumstances among states. In the past, different federal programs have chosen different options. Within TANF, we have examples of both absolute standards (the state work participation rates) and negotiated standards (the participation rates under the Tribal TANF program). JTPA used a regression model that took into account economic and demographic factors to adjust its performance standards for each state and for local areas. WIA provides for negotiated performance standards at both the state and sub-state level. Elements that must be considered in the negotiations process include: how the standards compare to other areas, taking into account economic and demographic factors and program design; the extent to which the standards promote continuous performance improvement; and the extent to which the standards assist the program in achieving a high level of customer satisfaction. (The use of outcome-based performance measures in these and other welfare and workforce development programs is discussed in more detail in Appendix A.)
One of the concerns that has been raised about modifying standards to reflect differences in demographic conditions is that it reduces the incentive for states to provide appropriate services to those populations identified as "hard-to-serve." The TANF program takes a unique approach to this issue with respect to the domestic violence hardship exemption. Section 408(a)(7)(C) of the Social Security Act, as amended by PRWORA, permits states to exempt victims of domestic violence from the time limit and, under regulations implementing that provision, from the work requirements. Individuals receiving an exemption from work participation rates or the time limit due to domestic violence are not removed from the initial calculations. However, if a state fails to meet the work participation rate requirements or exceeds the cap on time limit extensions, and can show that this failure is due to provision of good cause domestic violence waivers, HHS may grant reasonable cause relief from the penalties. States may only receive this relief if they have adopted the Family Violence Option and are providing appropriate services to individuals granted waivers. To date, no state has needed this relief.
Under the current participation rate requirements, similar relief is not provided to states that fail to meet the standards due to exemptions provided to individuals with other barriers to employment, such as mental health issues or substance abuse. In particular, states do not receive credit for engaging recipients in appropriate services that are not among the list of specific countable work-related activities.
A different approach, which does not directly adjust for economic and demographic conditions but has some of the same effects, is to reward states for improvements rather than (or in addition to) absolute levels of performance. This approach was taken by HHS in developing the TANF High Performance Bonus measures. This gives states that have performed poorly in the past a strong incentive to improve, even if they are unlikely to achieve results that place them in the ranks of higher-performing states. Moreover, since demographic conditions do not change very much from year to year, improvements are likely to be caused by changes in program operations rather than by underlying conditions.
While states participating in the consultation were generally receptive to the notion of basing some bonuses on improvement, some expressed concern about standards that have incremental increases each year. A few states that began their welfare reform efforts early on, under waiver policies, felt that they were approaching the maximum realistic levels of work participation and should not be penalized if they did not continue to improve.
States also expressed a great deal of concern at the consultation about rigid thresholds for penalties that create "cliffs" in which a small difference in outcomes could result in the imposition of large penalties. This is a particular concern where the data are believed to be "noisy" and error-prone. The current TANF regulations illustrate one way in which such threshold effects can be minimized - the amount of the penalty assessed for failure to achieve the minimum participation rate requirements is proportional to the degree of the failure. However, there is still a cliff under the statute because a state's failure to meet the participation rate, by whatever margin, results in its "maintenance of effort" (MOE) funding requirement increasing from 75 percent to 80 percent.
[ Go to Contents ]
Another important issue to consider in designing a performance measurement system is the consequences of meeting - or failing to meet - the established standards or performance targets. While the question has not been formally studied, it is reasonable to assume that the greater the dollar amount of the penalty or bonus, the greater the incentive or deterrent effect. Determining the optimal amount is a challenge. If a limited pool of bonus funds is divided among a large number of measures, all with significant weights, the incentive to perform well on any one measure is likely to be eroded. When a bonus is set at a fixed amount, regardless of the size of the state's basic grant, as was the case for the TANF bonus for reductions in out-of-wedlock births, it is likely to have more of an effect on states with smaller grants, for whom the bonus could be quite large in relation to their grant amount. Because of this consideration, the High Performance Bonus awards were allocated to the top performing states in amounts proportional to their TANF block grants. A third scenario is that the penalty is too large to be viable.
It is not clear, in fact, whether it is necessary to attach any financial consequences to an outcome-based performance measurement system. Some have argued that the honor of being singled out as a high performer - or particularly the stigma of being singled out as a poor performer - may be a powerful enough incentive on its own. For example, a substantial amount of attention is paid to the annual Kids Count Data Book, which reports state performance on a wide range of indicators of child well-being. There are also political consequences for states associated with being found penalty-liable or selected for a bonus, regardless of the dollar amount (Dickinson and West, 1988).
State feedback at our consultation suggested that the threat of being penalized was very salient, regardless of the amount of the penalty or whether it was ultimately possible to avoid the penalty through a corrective compliance process. In support of this argument, they noted that attempts to enforce financial penalties in the past have inevitably resulted in expensive and time-consuming administrative and judicial appeals, which have long delayed, if not negated, any actual transfer of funds. Penalties appear to have greater political consequences than bonuses, possibly because of the negative publicity and the great difficulty in finding the funds needed to replace the funds lost as a result of the penalty. (Under TANF, states that are subject to a penalty must replace the withheld funds with "state-only" funds which do not count toward satisfying the maintenance-of-effort requirement.)
There are some circumstances under which financial incentives may even be counterproductive. For example, financial incentives may result in increased "creaming" of participants, avoidance of innovative, but unproven strategies, or even inaccurate data reporting. When stakeholders are reluctant to adopt outcome measures, collecting performance data without financial incentives could relieve some of their concerns.
Across the states, legislatures have come down on both sides of this issue. In some states, budgeting has been linked to performance standards, so that high performing programs - and even individual offices - can receive additional money, while low performers are at risk of losing funding. In other states, there are no financial consequences attached to the performance standards, but the results are widely disseminated each year and used to provide feedback in order to improve program operations (Hatry, 1999; Horsch, 1996(a); Schilder, 1998; Yates, 1997). The data can help public managers and service providers make decisions and monitor progress toward specific goals. Coupled with program evaluation data, performance measures can potentially be used to assess service strategies, determine why results were achieved or not, and decide how programs need to be changed.
An additional factor must be considered when a new performance measurement system is adopted. As discussed above, when data for a new measure are first collected, in many cases, states will have little ability to predict their performance in advance - either because the program is new and there is no past performance, or because the data collection requirement is new and there are no baseline data. This uncertainty about performance levels appears to have very different consequences depending on whether a bonus or a penalty is involved.
In the context of penalties, performance uncertainty appears to lead to highly risk-averse behavior. For example, in defining work activities in which welfare recipients could participate, many states initially restricted the permissible activities to those that could be counted toward the federal work participation rate. Now that a few years of data are available, many states have discovered that they are in no danger of being penalized and have expanded the range of activities they allow for participants. Some states now include, for instance, educational activities not directly related to employment (including high school and equivalency programs, basic and remedial education, English as a Second Language, and post-secondary education), which counted toward the participation requirement under JOBS, among the permissible activities for TANF participants when determined appropriate.
In the context of bonuses, uncertainty appears to lead to a "wait-and-see" attitude. Without a solid idea of either how much effort is needed to achieve a certain level of performance or the potential payoff (including the size of the bonus), some states may be unwilling to invest much effort or money in order to improve their ratings. For example, in many cases, the states that received bonuses in the first year of the High Performance Bonus were those that had made investments in work and work supports even before the interim performance criteria were announced. It would not be surprising to see other states - particularly those that were close to receiving bonuses - now begin to make or expand their investments in these areas.
One possible means of mitigating the negative consequences of this asymmetry would be to implement a new measurement system in phases, beginning first with bonuses for high performers and adding penalties only after several years of experience with the measures, when more information would be available to use in setting standards. This approach was recommended by a participant in the consultation in post-consultation correspondence.
2. In program evaluation literature, the impacts of a program are defined as the differences in outcomes between a group who participated in the program compared to the average outcomes the group would have achieved had they not participated. In a formal evaluation, this comparison is most reliably estimated by randomly assigning individuals to an experimental group that participates in the program or to a control group that does not and comparing their outcomes. Because the experimental and control groups are randomly assigned, any differences in their outcomes can be assumed to be caused by the program being evaluated. [return to text]
3. In FY 1999, 15 states or territories did not serve two-parent families under the TANF program. They either served two-parent families entirely through separate state programs so the TANF two-parent participation requirements did not apply or did not serve two-parent families at all. (HHS, 2000(a)). [return to text]
Main page of report
Contents of report
Home Pages:
Human Services Policy (HSP)
Assistant Secretary for Planning and Evaluation
(ASPE)
Administration for Children and Families
(ACF)
U.S. Department of Health and Human Services
(HHS)
Updated: 02/06/01