CHERYL L.DAMBERG, MELONY SORBERO, ATEEV MEHROTRA, STEPHANIE TELEKI, SUSAN LOVEJOY, AND LILY BRADLEY
WR-474-ASPE/CMS
November 2007
Prepared for the Assistant Secretary for Planning and Evaluation, US Department of Health and Human Services
WORKING PAPER
This product is part of the RAND Health working paper series. RAND working papers are intended to share researchers’ latest findings and to solicit additional peer review. This paper has been peer reviewed but not edited. Unless otherwise indicated, working papers can be quoted and cited without permission of the author, provided the source is clearly referred to as a working paper. RAND’s publications do not necessarily reflect the opinions of its research clients and sponsors.
CONTENTS
PREFACE
TABLES
SUMMARY
ACKNOWLEDGEMENTS
ABBREVIATIONS
INTRODUCTION
Background
Development of the Value-Based Purchasing Plan
Content and Structure of This Report
A REVIEW OF THE EVIDENCE ON HOSPITAL PAY FOR PERFORMANCE
Summary of the Empirical Evidence on the Impact of Hospital Pay for Performance
Theoretical Literature and implications for p4p design
Limitations in using Economic Theories to Predict Behavioral response
Conclusions
SUMMARY OF DISCUSSIONS WITH PAY-FOR-PERFORMANCE PROGRAM SPONSORS
Methodological Approach
Findings From Discussions with Program Sponsors
Critical Lessons Learned
IV. SUMMARY OF DISCUSSIONS WITH HOSPITALS, HOSPITAL ASSOCIATIONS, AND DATA VENDORS
Methodology
V. SUMMARY OF FINDINGS FROM ENVIRONMENTAL SCAN
APPENDIX A: DESIGN ISSUES EXPLORED AS PART OF THE ENVIRONMENTAL SCAN
APPENDIX B: SUMMARY OF PAY-FOR-PERFORMANCE DESIGN PRINCIPLES
APPENDIX C: INPATIENT HOSPITAL MEASURES
APPENDIX D: LIST OF ORGANIZATIONS PARTICIPATING IN THE ENVIRONMENTAL SCAN
REFERENCES
TABLES
Table 1: Design Issues Explored with Program Sponsors and Hospitals
Table 2: Key Terms Used to Search the Literature for Hospital P4P Studies
Table 3: Summary of Design Features of P4P Programs Contained in Published Evaluation Studies
Table 4: Summary of Evaluation Studies Examining Hospital P4P Programs
Table B.1. P4P Principles and Recommendations from Stakeholders
Table B.2. Summary of P4P Design Principles and Recommendations
In recent years, pay-for-performance (P4P) programs have emerged as a strategy for driving improvements in the quality, safety, and efficiency of delivered health care. In 2005, with passage of the Deficit Reduction Act, Congress mandated that the Secretary of the Department of Health and Human Services (DHHS) develop a plan for value-based purchasing (VBP) for Medicare hospital services. VBP is one strategy for modifying the payment system to incentivize improvements in the quality of care delivered to beneficiaries in the Medicare program. The use of incentives—by paying differentially for performance—is a key component of building a value-driven health care system as called for by the DHHS Secretary’s Four Cornerstones Initiative.
To inform the development of the VBP plan for Medicare hospital services, the Assistant Secretary for Planning and Evaluation (ASPE), in collaboration with the Centers for Medicare & Medicaid Services, contracted with the RAND Corporation to conduct an environmental scan of the hospital P4P landscape. This report presents the results from the environmental scan of P4P and pay-for-reporting (P4R) programs; it also includes a review of the empirical evidence about the impact of these programs, a description of program design features, and a summary of lessons learned from currently operating P4P and P4R programs about the structure of these programs and implementation issues.
This work was sponsored by ASPE under Task Order No. HHSP233200600001T, Contract No. 100-03-0019, for which Susan Bogasky served as the Project Officer.
Mounting cost pressures and substantial deficits in the quality of care within the U.S. health care system have led policy makers to consider various reform options. Pay for performance (P4P) has emerged as a leading reform strategy, in an effort to stimulate improvements in the quality, safety, and efficiency of delivered health care (IOM, 2006). In 2005, Congress passed the Deficit Reduction Act (DRA, Public Law 109-171, Section 5001(b)), which mandated that the Secretary of the Department of Health and Human Services (DHHS) develop a plan for value-based purchasing (VBP) for Medicare hospital services that would commence in Fiscal Year (FY) 2009. VBP, which is being applied by payers in both the public and private sectors, includes the use of both financial (e.g., P4P) and non-financial (e.g., transparency of performance scores) incentives to change the behavior of providers and the systems within which they work.
The use of incentives—by paying differentially for performance—and measuring and making quality information transparent are key components of building a value-driven health care system, as called for by the DHHS Secretary Leavitt’s Four Cornerstones Initiative. In support of this initiative, CMS has taken a number of steps toward using incentives and making quality information transparent, by funding pay-for-performance demonstrations in the hospital, physician, and home health settings, and by implementing pay for reporting (P4R) for hospitals, through the Reporting Hospital Quality Data for Annual Payment Update (RHQDAPU) program, and for physicians through the Physician Quality Reporting Initiative (PQRI).
AN ENVIRONMENTAL SCAN OF HOSPITAL PAY FOR PERFORMANCE
The DRA required the Secretary of the DHHS to consider the following design elements when developing the VBP plan: (1) the process for developing, selecting, and modifying measures of quality and efficiency; (2) the reporting, collection, and validation of quality data; (3) the structure, size, and source of value-based payment adjustments; and (4) the disclosure of information on hospital performance. The CMS Hospital VBP Workgroup was delegated the task of developing the VBP plan for Medicare hospital services.
To inform the development of the VBP plan the Assistant Secretary for Planning and Evaluation (ASPE) and CMS issued a contract to the RAND Corporation to conduct an environmental scan of the hospital P4P landscape. The environmental scan, conducted between August of 2006 and June of 2007, included:
To take advantage of the experimentation going on nationally with respect to P4P program design and implementation, discussions were held with 27 program sponsors, 28 hospitals, 7 hospital associations, 5 data support vendors, and a number of individuals with expertise in rural hospital issues. The discussions were necessary because this type of descriptive information and this level of detail about program design are not typically contained in peer-reviewed journal articles that summarize the results of P4P interventions. Additionally, many of the demonstration experiments are still in their infancy, and little has been formally documented about the related experiences. This report summarizes the findings from the environmental scan.
FINDINGS FROM THE LITERATURE REVIEW
The Empirical Literature on Hospital P4P
As of June 2007, few peer-reviewed studies existed on the use of financial incentives and their impact on quality, patient experience, safety, or the efficient use of resources. While more than 40 hospital-based P4P programs are operating in the U.S., little empirical evidence has emerged from these payment reform experiments to gauge the impact of hospital P4P in meeting programmatic goals or to understand how various design features affect such things as engagement in the program, the likelihood of creating unintended consequences (such as reductions in access to care for more difficult patients), or the distribution of payments to providers. Few P4P programs are undergoing formal evaluations to assess their impact, and challenges arise in conducting evaluations of real-world applications because the applications generally lack a comparison group that is required to assess the impact of the P4P intervention.
We reviewed the literature between January 1996 and June 2007 and found only nine published studies that address the impact of three separate hospital P4P programs in which formal evaluations have been occurring:
Of the eight studies examining changes in performance, each one reported improvements over time in at least some of the hospital performance measures or condition-specific composites included in the specific study; however, it is difficult to disentangle the P4P effect from the effect of other quality improvement efforts that were occurring simultaneously. The strongest evidence on the impact of hospital P4P to date has been shown through the Lindenauer (2007) study of the impact of PHQID relative to the Medicare RHQDAPU program. These studies, while showing a positive effect of P4P, reveal that the additional effects of P4P are somewhat modest relative to public reporting and other quality interventions that are occurring simultaneously. Improvements in hospital performance have been observed in response to feedback reports (Williams et al., 2005) and public reporting, with a financial incentive for submitting data (Grossbart, 2006; Lindenauer et al., 2007). One study found improvements in a few performance areas associated with P4P as compared with what was seen for control hospitals participating in voluntary quality improvement activities (Glickman et al., 2007). It has been argued, however, that in order to accomplish sustained quality improvement, interventions should be multifaceted and focus on different levels of the health care system (Grol et al 2002; Grol and Grimshaw 2003). This suggests that to be most effective, P4P should be partnered with other activities such as public reporting and internal quality improvement activities, that also encourage quality improvement for the same clinical area.
There is less evidence of the effect of P4P on patient outcomes. One study (Berthiaume et al., 2006) found reduced complication rates for obstetrical and surgical patients in an uncontrolled study, though it was not reported whether those improvements were statistically significant. Glickman et al. (2007) did not find significant differences in inpatient mortality improvement for AMI between PHQID and control hospitals exposed to an AMI quality improvement intervention.. None of the studies evaluating PHQID separately analyzed the other patient outcome measures (for coronary bypass survey and hip and knee replacement surgery) included in the program, so it is not clear whether improvements occurred in these measures.
Most of the published studies have significant methodological limitations. Six of the nine had no controls, which are critical for providing evidence of a link between P4P and performance improvements. This is particularly important given the documented temporal trend toward increasing performance on many hospital quality metrics. Another important issue to consider is whether the experience of these smaller-scale incentive programs, with the exception of the PHQID, could be generalized to reflect what the effects would be of wholesale national implementation of a hospital P4P program by Medicare.
Theoretical Literature and Implications for P4P Design
P4P is common in industries other than health care, and economists and management experts have studied and developed theories on how individuals respond to financial incentives. The economic and management theories that we reviewed suggest that the way in which P4P incentives are structured, or framed, could influence whether they achieve the desired behavioral response. Among the key highlights of this literature review:
FINDINGS FROM THE KEY INFORMANT DISCUSSIONS
Design Lessons
Discussions with program sponsors, hospitals, and data vendors revealed the following lessons about P4P program design and operation:
Payment structures—Existing P4P programs primarily make reward payments on the basis of improving over time or relative performance. Hospitals universally agreed that payment structures should use absolute thresholds and reward all good performers, rather than providing incentives on a relative-performance basis (such as paying only to the top 10 or 20 percent of hospitals participating in a P4P program). This was seen as critical when the measures of performance used have scores that “top out,” reflecting little meaningful difference in the performance across most hospitals. Programs sponsors felt strongly that performance improvement as well as attainment of specific benchmarks should be included as a component of the payment structure, at least in the early years of a P4P program, in order to engage all hospitals. Hospitals also noted the difficulty of getting physicians to change their behavior absent aligned incentives on the physician side, and called for program sponsors to create parallel physician incentives focused on inpatient care for the same conditions used in hospital programs.
Absence of Knowing What Works—Because P4P is a newly emerging reform tool and little information is currently available about the impact of P4P or the influence of various design structures on P4P outcomes, P4P programs should incorporate evaluation and ongoing monitoring into their design as a means of building a knowledge base. Hospitals and P4P program sponsors recommended allowing experimentation, which would create models where learning could occur to inform future design structures. The discussants noted that the results of P4P may differ as a function of the program design features as well as the varying structure of local health care markets, and that much could be gained from examining the experience of these local experiments. Collecting and broadly disseminating this type of information will be critical to future efforts to construct P4P programs so that they can meet their programmatic objectives. Funding will be necessary to support program evaluation, and the evaluation work needs to be sustained over multiple years to fully assess impact and monitor for unintended consequences.
Program Implementation Challenges
The environmental scan also uncovered a number of program implementation challenges that warrant consideration during program design and implementation.
The small numbers problem: A sizeable number of hospitals have only a small number of events or cases to report for one or more measures. A small number of events to score will result in unstable estimates of performance as a basis for determining performance-based incentive payments. While this is a more acute problem for small and rural hospitals with a small number of patients per year, the problem also occurs in some medium- and large-size hospitals depending on their service mix, the details of measure specifications, and the use of sampling during data collection. Using all-payer data, collecting and aggregating data over longer periods of time, using composite measures,1 and identifying measures relevant to smaller providers are approaches that can help to mitigate the small numbers problem and allow for the construction of more stable estimates of performance.
The Burden of Data Collection: The data collection burden, which affects how many measures a P4P program can reasonably require a hospital to collect and report, creates challenges for efforts to comprehensively assess the performance of hospitals given the wide range of care and services provided within hospitals. The more comprehensive the measure set used, the greater the burden on hospitals in the near term, given that most of the data needed to construct performance measures is contained in paper medical records. In most cases, hospital information systems are not yet equipped to capture and easily retrieve the clinical information used to create performance measures, nor are they structured to enable routine monitoring of quality of care. Until health information systems are upgraded to capture this information, program sponsors may be constrained in the number and breadth of measures they can expect hospitals to collect and report. Once effective information systems are built and put into place, the number of measures included in a P4P program could be expanded.
Ensuring the Validity of Data used to Make Differential Payments: P4P programs are also challenged with an acute need to ensure the integrity of the data used to score hospitals and make differential payments, which requires resources for data validation. Allocating sufficient resources to validation work is critical for program credibility, and today only limited resources are being used for data validation within P4P programs. Most hospitals stated that the current level of validation is insufficient, and the incentives to game the system will increase as the amount of money at risk in P4P programs increases.
In summary, P4P programs have the potential to drive system improvements but their impact is likely influenced not only by their design but also by what other structures are in place to support P4P—such as enhanced information systems for quality monitoring and feedback, aligned payments across all providers, and transparency. The success of these programs in meeting improvement goals likely will be affected by their design, how they are implemented, and whether sufficient resources are allocated to provide the necessary day-to-day support for program operations and ongoing modification of the program.
Hospitals understand that P4P is likely to be part of their future and generally seem supportive of the concept. They face a number of challenges to their ability to successfully participate in these programs, including lack of physician engagement, inadequate information infrastructure that necessitates the manual collection of data from charts, and potentially conflicting signals from various organizations measuring hospital performance. These implementation challenges are important to consider carefully in the design of any hospital P4P program.
We gratefully acknowledge the sponsors of the pay-for-performance programs and the hospitals, hospital associations, and data vendors whose people willingly made the time to participate in individual discussions with us. They offered us valuable information and insights about their experiences in designing and implementing pay-for-reporting and pay-for performance programs.
We also extend our appreciation to the members of our Technical Expert Panel—Dr. Elliott Fisher of Dartmouth University, Dr. Jack Wheeler of the University of Michigan School of Public Health, Dr. Dale W. Bratzler of the Oklahoma Foundation for Medical Quality, and Dr. Howard Beckman of the Rochester Individual Practice Association—for their thoughtful review of the discussion guides to help ensure that pertinent topics and issues were addressed and their review of this report. In addition, we appreciate the assistance provided by Geoff Baker of Med-Vantage in helping us construct and narrow the list of candidate hospital pay-for-performance programs with which we held discussions. Finally, we thank Susan Bogasky, from the Assistant Secretary for Planning and Evaluation, who served as Project Officer for this contract. We also appreciate the guidance and feedback provided by Dr. Julie Howell, Project Coordinator Hospital VBP, CMS Special Program Office for Value-Based Purchasing, and Dr. Thomas Valuck, Director, CMS Special Program Office for Value-Based Purchasing.
The Cost and Quality Problems
Substantial, well-documented deficiencies exist in the quality of care that is provided to patients in the United States (Institute of Medicine [IOM], 2001; Schuster, McGlynn, and Brook, 1998; Wenger et al., 2003). In a landmark study published in 2003, McGlynn et al. (2003) found that adult patients received only about 55 percent of recommended care and that adherence to clinically recommended care varied widely by medical condition. The follow-on analysis, conducted by Asch et al. (2006), found that the quality deficit was persistent across all sociodemographic subgroups and that although quality of care varied moderately across the sociodemographic subgroups, there was substantial underuse of recommended care regardless of income, race, or age. Other studies, such as those by Fisher et al. (2003a and b), have shown that among Medicare beneficiaries, there is substantial regional variation in the use of services and health spending. Also, regions where more services were provided did not show additional benefit to patients either through improved outcomes or improved satisfaction with care. These studies highlight that problems occur in both the underuse of recommended care services and the overuse of services.
Health care costs continue to rise at a steady pace and are anticipated to account for 18.7 percent of gross domestic product by 2014 (Heffler et al., 2005). In 2006, the federal government spent $600 billion for Medicare and Medicaid for care delivered to its approximately 87 million beneficiaries; and it is anticipated that by 2030, expenditures for these two programs will consume 50 percent of the federal budget, a financial burden that will place funding for other discretionary programs at risk (McClellan, 2006). To improve quality and hold down growth in the costs of the Medicare and Medicaid programs, the Centers for Medicare & Medicaid Services (CMS) will need to explore alternatives to existing policies and practices.
The Disconnect Between Payments and Performance
Existing mechanisms for paying hospitals, both Medicare’s per-hospitalization payments using diagnosis-related groups (DRGs) and the per diem payments used by commercial payers, do not differentiate payments to hospitals providing efficient, high quality care. Current payment policies in both the public and the private sector reward the quantity rather than the quality of care delivered and provide neither incentive nor support for improving quality of care. Historically, hospitals have gotten paid the same regardless of the quality of care they provided and, in some cases, may have even received additional payment for treatment of avoidable complications and for readmissions and complications that occurred as a result of providing poor quality care. Starting in 2008, CMS has announced that it will no longer pay Prospective Payment System (PPS) hospitals for the additional costs of certain preventable conditions acquired in the hospital (CMS, 2007a).
Calls for System Reform
The 2001 IOM report Crossing the Quality Chasm called upon policymakers in the public and private sectors to make reforms that would address problems of quality and inefficiencies. A key reform recommended by the IOM was to create financial incentives for quality and to make performance information transparent to ensure public accountability. More recently, the IOM made specific recommendations for implementing payment rewards for performance within Medicare in its 2006 report titled Rewarding Provider Performance: Aligning Incentives in Medicare. Additionally, the Medicare Payment Advisory Commission (MedPAC), which advises the U.S. Congress on issues related to the Medicare program, has recommended that Medicare adopt pay for performance (P4P) across various settings, including Medicare Advantage plans and dialysis providers and hospitals, home health agencies, and physicians (MedPAC, 2005).
Federal Actions to Reform the System
On August 22, 2006, President Bush issued an Executive Order, “Promoting Quality and Efficient Health Care,” that requires the federal government to: (1) ensure that federal health care programs promote quality and efficient delivery of health care and (2) make readily useable information available to beneficiaries, enrollees, and providers. These actions are designed to drive improvements in the value of federal health care programs.
To support this mandate, Department of Health and Human Services (DHHS) Secretary Michael Leavitt embraced “four cornerstones” for building a value-driven health care system:
Building on these four cornerstones, CMS has taken steps toward using incentives and making quality information transparent in order to become a value-based purchaser of care. The steps taken include funding a number of demonstrations regarding use of financial incentives across hospital, physician, and home health settings, and implementing pay for reporting (P4R) for hospitals and physicians through the Reporting Hospital Quality Data for Annual Payment Update (RHQDAPU) program and the Physician Quality Reporting Initiative (PQRI). In particular, the RHQDAPU program, which was mandated under the Medicare Prescription Drug Improvement and Modernization Act of 2003 (MMA),2 required hospitals to submit data on a defined set of performance measures to receive 0.4 percentage points of their annual payment upda(APU). The performance data from RHQDAPU are made transparent to Medicare beneficiaries and the public through the CMS Hospital Compare website (http://www.hospitalcompare.hhs.gov ). Section 5001(a) of the 2005 Deficit Reduction Act (DRA) expanded the set of RHQDAPU P4R performance measures and increased the differential payment for reporting from 0.4 to 2 percentage points.
The 2005 DRA also authorized the DHHS Secretary, under Section 5001(b), to develop a plan for value-based purchasing (VBP) for Medicare hospital services commencing fiscal year (FY) 2009. Congress specified that the VBP plan consider the following design issues:
Through implementation of VBP for Medicare hospital services, CMS would provide differential payments to hospitals based on their performance (i.e., P4P).
In response to the DRA mandate, CMS created an internal hospital VBP workgroup with responsibility for developing the VBP plan. To inform the development of the plan, the Assistant Secretary for Planning and Evaluation (ASPE), in collaboration with CMS, contracted with the RAND Corporation in July 2006 to conduct a literature review to synthesize the empirical evidence that exists on P4P in the hospital setting and an environmental scan of the existing P4P landscape.
To take advantage of the experimentation going on nationally with respect to P4P program design and implementation, RAND held discussions with P4P program sponsors, hospitals, hospital associations, data support vendors, and organizations experienced with small and rural hospitals to capture the array of experiences connected with the design and implementation of P4P and P4R programs. The discussions were necessary because this type of descriptive information and this level of detail about program design are not typically contained in peer-reviewed journal articles that summarize the results of P4P interventions. Additionally, many of the demonstration experiments are still in their infancy, and little has been formally documented about the related experiences.
RAND was tasked to:
Table 1 highlights core design issues that were examined as part of the environmental scan. Appendix A contains a complete listing of the design issues that were explored.
| Issue Type: | Issue: |
|---|---|
| Overview | The goals of existing P4P programs and demonstrations in the hospital setting |
| Whether and how hospitals were included in the design and implementation of P4P and P4R programs | |
| The mechanisms used to monitor for unintended consequences, such as inappropriate clinical care or gaming of data to secure bonus dollars | |
| Lessons learned by organizations with P4P and P4R programs in practice or participating in demonstrations | |
| Measures | The measures of performance (clinical effectiveness, efficiency, patient experience, care coordination/transitions, etc.) that are currently being used for both inpatient and outpatient hospital care in practice and in demonstrations |
| The measures selection criteria being used by P4P and P4R programs | |
| Methodological issues around P4P, including the level of aggregation of measures (i.e., composite scoring, weighting); the establishment of benchmarks, thresholds, and targets; risk adjustment; and opportunities for gaming | |
| Data | The data collection, data management, reporting infrastructure, and data outreach required to implement existing P4P programs |
| Methods being used to validate data for use in P4P programs | |
| Payment Mechanism | The types of incentives, financial or non-financial, that currently exist or are under consideration, and what has been learned from various incentive structure designs |
| Examining the basis for payment, such as paying on meeting a threshold, improvement, and/or high achievement | |
| The levels (fixed dollar, percentage of payments) and types (negative versus positive) of financial incentives being used | |
| Public Reporting | How information from public reporting systems is being used, and the impact of this information |
| Strategies for simplifying public reports to facilitate use and understanding | |
| Outpatient | Whether outpatient hospital services should be incorporated into VBP in the future |
| Extent to which current P4P programs include measures of hospital outpatient services |
This chapter builds the foundation for subsequent chapters of this report by defining P4P and its dimensions and by providing the policy context underlying the rationale for P4P as a system reform strategy.
Defining Value-Based Purchasing
VBP is a strategy that strengthens the link between quality and provider payments by rewarding providers that deliver high-quality, cost-efficient care. VBP encompasses a number of activities that can be used individually or as a mutually supportive set to engender provider behavior change. One activity that falls under the VBP umbrella and has garnered much attention and interest in recent years is P4P. P4P explicitly links health care providers’ pay to their performance on a set of specified measures such that better-performing providers receive higher payments than do lower-performing providers. The term provider, which we use throughout this report, encompasses a broad spectrum of health care providers: hospitals, individual physicians, physician practices, medical groups, and integrated delivery systems.
P4P programs seek to align measurement of and payments to providers with a program sponsor’s goals, such as the delivery of high-quality, cost-efficient, patient-centered care. For example, if a program sponsor is seeking to improve patient outcomes, the program will include either measures of risk-adjusted mortality or complications rates or clinical measures, such as the provision of disease-specific services. If that program sponsor also seeks to improve the cost efficiency of care, the program may also include readmission rates or risk-adjusted length of stay. P4P programs are designed to financially reward those providers whose performance is consistent with the program sponsor’s identified goals.
Three other mechanisms that use financial and non-financial incentives also seek to incentivize changes in provider and/or consumer behavior as means to improve quality and efficiency in health care delivery. These three mechanisms were excluded from our environmental scan of P4P in the hospital setting per se, although public reporting is often a component of P4P programs and is a core quality improvement strategy that CMS is currently implementing through the RHQDAPU program. The mechanisms are as follows:
Principles for Pay-for-Performance Programs
Numerous organizations have developed design principles for P4P programs in the hopes of influencing how CMS and other P4P sponsors structure their P4P programs (see Appendix B). Among these organizations are MedPAC, the Joint Commission, employer coalitions, the American Medical Association (AMA) and other physician groups, the American Hospital Association (AHA), and the Association of American Medical Colleges (AAMC).
The principles cover a wide variety of program design and implementation issues, and at times the recommendations made by the different organizations directly oppose one another. Five major areas of disagreement about P4P design and implementation issues are:
There was also variation in the topics explicitly included by organizations in their statements. For example, physician organizations frequently include these principles: voluntary participation, no link between rewards and the ranking of physicians relative to one another, reimbursement of physicians for the administrative burden of collecting and reporting data, and physician involvement in program design.
There are, however, areas of consensus. Nine or more organizations endorsed the following principles/recommendations:
The remainder of this report presents the findings of RAND’s environmental scan of hospital P4P. Chapter 2 reviews the empirical literature on the impact of hospital P4P. It also draws from the economics and organizational management theoretical literature that has examined the effect of incentives on behavior to assess possible implications for P4P program design. Chapter 3 summarizes our discussions with hospital P4P program sponsors nationally, focusing on a description of the measures being used by these programs, the structure of the incentive payments, operational issues associated with implementation, and lessons learned. Chapter 4 summarizes our discussions with hospitals that have been exposed to P4P and P4R efforts (such as the CMS RHQDAPU program, the Premier P4P demonstration, or private-sector P4P programs), hospital associations, and data vendors that support hospitals in their data submissions to the array of performance-reporting efforts. Our emphasis in these discussions was on learning what hospitals thought about the set of performance measures for which they were being held accountable, the structure of the incentive payments, issues related to data submissions and the quality and validity of data used to score their performance, the importance of public reporting, barriers they saw as hampering their ability to comply with the program requirements, and lessons they had learned. As part of these discussions, we also focused on understanding the unique issues of small, rural, and Critical Access Hospital (CAH) hospitals that would affect their ability to participate in P4P programs. Chapter 5 concludes by summarizing the key findings from the environmental scan.
This chapter summarizes the empirical evidence on the effect of P4P in the hospital setting, based on application and theory. We begin with a review of published studies that assess the impact of P4P programs on health care quality, safety, and/or resource use, including studies that address P4P in either the hospital inpatient or the hospital outpatient setting. We then follow with a summary of relevant lessons for hospital P4P that can be drawn from the management and economic literature on how individuals in general respond to incentives, and we consider the implications for structuring incentives to achieve the desired behavioral response.
Methods
Our review of the empirical literature on the effects of P4P included all peer-reviewed published studies describing the impact of a hospital P4P program for either inpatient or outpatient hospital services. We defined outpatient hospital services as any medical or surgical services performed primarily in an outpatient/ambulatory care setting that are billed through a hospital. Examples of outpatient hospital services include chemotherapy, outpatient surgery, and diagnostic tests such as colonoscopy. The review included any randomized control studies, quasi-experimental trials, and pre-/post-intervention studies. We only retained articles that reported empirical findings related to the effect of paying for quality, patient experience, and safety or resource use, specifically excluding articles focused only on the impact of changes in hospital payment, such as the shift to the Prospective Payment System (PPS) and P4P as applied to physicians in the ambulatory setting. Only studies that were in English and published in the last 10.5 years were included.
We searched for articles published between January 1996 and June 2007 using five bibliographic databases (PubMed, EconLit, CINAHL, Psycinfo, and ABInform) that could include articles related to P4P and financial incentives specific to the hospital environment. Table 2 displays the search strategy and terms used to identify relevant articles for hospital inpatient and hospital outpatient settings separately.
Hospital Inpatient |
Hospital Outpatient |
|---|---|
| pay for performance OR p4p OR “pay for quality” OR “pay for value” OR “value based purchasing” OR “financial incentives” OR “monetary incentives” | “pay for performance” OR p4p OR “pay for quality” OR “pay for value” OR “value based purchasing” OR “financial incentives” OR “monetary incentives” |
| (bonus* OR reward* OR (incentive reimbursement)) AND quality | This resulted in a database of 1,575 articles. Within this database, we retained any article that included the following keywords:
|
| hospital OR hospitals | |
| (Results from search #1 or #2) AND (Results from Search #3) | |
| NOT (organ donation) |
We combined the results of this search strategy for each setting (conducted initially in November 2006 and update with articles published through June 2007) from the five different databases and then eliminated duplicate articles. Titles and abstracts for these articles were reviewed, and potentially eligible articles were identified. The full text of the set of potentially eligible articles was then read to determine whether the article was appropriate for inclusion. Reference lists of the included articles were checked to identify additional relevant studies. To ensure our scan was comprehensive, we also consulted experts in the field of P4P and retrieved references from recent reports on P4P and payment reform from the IOM, the Joint Commission, MedPAC, and the Agency for Healthcare Research and Quality (AHRQ).
From the initial search strategy, we identified 902 non-duplicated articles for the hospital inpatient setting and 162 non-duplicated articles for the hospital outpatient setting. After the abstracts were reviewed, eleven articles were targeted for further review for the inpatient setting and zero for the hospital outpatient setting. Of the eleven articles, eight met our criteria for inclusion. After consultation with P4P experts and a review of relevant reports, one more paper was thought to be sufficiently important to include. It is a white paper, not published in the peer-reviewed literature, describing the early results of the CMS–Premier Hospital Quality Incentive Demonstration (PHQID). Our summary therefore focuses on the findings from nine articles that describe P4P intervention in the inpatient setting.
The methodological quality of the articles was assessed by evaluating the overall study design in terms of its strength in determining a causal relationship or an association between the intervention and the outcome. For example, we determined whether the study design was a pre-post measurement without a control group, a pre-post study with a control group (a quasi-experimental study design), or a randomized control trial. If there was a control group, we also assessed its adequacy, such as whether hospitals in the control group were reasonably similar to hospitals exposed to the P4P intervention. If there was no control group, we assessed whether the study controlled for pre-intervention trends in performance. Lastly, we assessed the studies’ use of appropriate statistical methods for estimating an intervention effect. These characteristics were used to determine the quality of the studies being reviewed, with randomized control trials providing the strongest evidence of a causal relationship between the implemented program and changes in performance measures, and uncontrolled studies providing weaker evidence.
Findings from the Literature Review
As of June 2007, few peer-reviewed studies existed on the use of financial incentives to affect quality, patient experience, safety, or the efficient use of resources. While more than 40 hospital-based P4P programs are operating in the U.S., few of them are undergoing formal evaluations to assess their impact.
The nine articles in our review address the impact of three separate hospital P4P programs in which formal evaluations have been occurring:
Hospital P4P Program |
Type of Measures |
Type of Performance Target |
Form of Financial Incentive |
|||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Outcome | Process | Structure | Patient Experience |
Patient Safety |
Absolute | Relative | Bonus |
Withhold |
Penalty |
|
| HMSA | X | X | X | X | X | X | X | |||
| BCBS of Michigan | X | X | X | X | X | X | ||||
| PHQID | X | X | X | X | X | X | ||||
Table 3 presents a high-level summary of key design features of each of these three P4P programs. Table 4 provides descriptive data on the evaluation studies. More detailed findings from our evaluation are in the following subsections.
P4P Program |
Article |
Type of Study |
Change in Performance |
Control Group |
|---|---|---|---|---|
| HMSA P4P Program | Berthiaume et al., 2004 | Describes uptake of one component of program and how many dollars were dispensed | No | No |
| Berthiaume et al., 2006 | Describes trends in measures | Yes | No | |
| BCBS of Michigan Hospital Incentive Program | Nahra et al., 2006 | Cost-effectiveness analysis | Yes | No |
| Sautter et al. 2007 | Qualitative interviews with leadership of 10 participating hospitals | NA* | No | |
| Reiter, Nahra, and Wheeler, 2006 | Survey of participating hospitals to track behavioral responses | No | No | |
| PHQID | Premier White Paper | Describes improvements in quality measures | Yes | No |
| Grossbart, 2006 | Evaluates improvements in quality versus a “matched” control group | Yes | Yes | |
| Lindenauer et al., 2007 | Evaluates improvements in quality versus a “matched” control group | Yes | Yes | |
| Glickman et al., 2007 | Evaluate improvements in quality versus a control group | Yes | Yes |
Note to Table Four: Change in performance was used to select hospitals for the interviews and not the outcome examined by the research.
Hawaii Medical Service Association Pay-for-Performance Program
Two papers evaluated the impact of the HMSA P4P program, which started in 2001 and targeted all 17 hospitals in Hawaii. The program had four components:
The complication and length-of-stay measures focused on patients admitted to the obstetric service or undergoing one of the 18 most common surgical procedures, which accounted for approximately 50 percent of the surgical case volume. The HMSA hospital P4P program has been evaluated, and the results of the evaluation are contained in two articles by Berthiaume and colleagues (2004 and 2006).
Berthiaume et al., 2004: This study looks at the rates of participation in the “Get with the Guidelines—Coronary Artery Disease” component of the HMSA P4P program. The authors report that of the 13 hospitals in Hawaii with more than 30 admissions for acute coronary artery disease, 10 earned some points associated with participation in “Get with the Guidelines.” The average incentive amount to the 10 hospitals ranged from $5,514 to $114,574 in one year. The authors state that the fact that 85 percent (11/13) of the eligible hospitals participated in “Get with the Guidelines” is noteworthy because this level of program adoption “is much higher than would be predicted by models of diffusion of innovation in healthcare.” The authors report that the incentive dollars helped provide support within hospitals for salaries and travel costs and led to substantial changes to the systems of care.
This study suffers from several limitations that restrict our ability to assess the impact of the P4P program. It reports only how many hospitals participated in the program at a single point in time, 2003—not whether participation, number of points earned, or scores on the myocardial infarction process measures increased over the intervention period. Since there was no control group, it is unclear whether participation in the “Get with the Guidelines” care improvement effort was truly driven by the incentive program versus other factors. Hospitals around the country were being encouraged to enroll in the program, and many of the measures that the program used were also being used by the Joint Commission and CMS as part of their quality measurement and improvement efforts. This study does not provide evidence on the impact of the incentive program in changing clinical process or outcome measures and how the results might generalize more broadly.
Berthiaume et al., 2006: This second study by Berthiaume and colleagues reports changes in the following HMSA P4P program areas: length of stay, complication rates, patient satisfaction, and the hospital’s internal quality initiatives. It does not report changes in the clinical process of care measures for AMI. The study design used pre-post measurement with 2001 as the baseline year and 2004 as the final year of available data. The HMSA program awarded $9 million in financial incentives across all parts of the program in 2004.
The authors report that complication rates for both obstetric and surgical patients declined approximately 2 percentage points between 2001 and 2004. Average length of stay also decreased for both types of patients; surgical patients experienced a decrease in length of stay of approximately 1.2 days, whereas length of stay for obstetric patients decreased by approximately 0.4 days. Patient satisfaction with inpatient care remained stable (78 percent in 2001 versus 79 percent in 2004); satisfaction with emergency room care increased from 71 percent in 2002 to 75 percent in 2004. Lastly, the scoring mechanism for internal quality initiatives was changed halfway through the program; but between 2003 and 2004, the scores increased from 4.25 to 6.5 points out of a total of 10 possible points. The authors do not state whether the observed differences between time periods were statistically significant. However, confidence intervals shown in figures contained in the article appear to indicate that only the change in surgical length of stay was statistically significant.
The authors state that it is unclear whether these upward shifts in performance were caused by the HMSA P4P program intervention or other factors occurring more broadly, such as greater national emphasis on improvements in AMI care or efforts to reduce utilization. As is typical for P4P programs being implemented nationally, the HMSA program did not have a control group to determine the effect of the HMSA intervention separate from other factors that may have caused the observed changes.
Blue Cross and Blue Shield of Michigan Hospital Incentive Program
Two published papers have examined the impact of the BCBS of Michigan Hospital Incentive Program. This program was initiated in 2000 and fully implemented in 2001 between BCBS of Michigan and the 86 hospitals statewide with which it contracts. Under the incentive program:
As of this review, no results have been published describing changes in quality metrics in response to this program. The three evaluation studies that have been published examine the cost-effectiveness of the program (Nahra et al., 2006), results of qualitative interviews with leadership at 10 participating hospitals (Sautter et al., 2007) and the results of a survey of organizational changes that participating hospitals reported making in response to the P4P program (Reiter, Nahra, and Wheeler, 2006).
Nahra et al., 2006: This study estimated the cost-effectiveness of the Michigan BCBS Hospital Incentive Program from the sponsor of the health plan program’s perspective. In estimating the costs, the researchers included incentive amounts paid to hospitals by BCBS and the costs of administering the program. Benefits from the program were estimated by using increases in performance on the process measures to calculate the number of patients receiving improved heart care. These calculations were combined with published clinical trials data to estimate how many quality adjusted life years (QALYs) would be saved from the improved heart care over the 2000–2003 period. The researchers estimated that the clinical quality improvements observed would lead to savings of 733 to 1,701 QALYs. Based on this calculation and the cost of the program to the health plan, the cost per QALY was between $12,967 and $30,081, a range generally considered to be cost-effective (Ubel et al., 2003). This study illustrates that modest quality improvements can lead to substantial gains in QALYs saved. Additional unpublished information obtained from the program evaluator (private communication J Wheeler) indicated hospitals reported incremental costs for participation in the P4P program were on average $36,915 for large teaching hospitals and $28,525 for other hospitals. Even taking these into account, the program would be considered cost effective.
One limitation of this evaluation is the absence of a control group or trend data from the period prior to intervention to know whether the observed improvements in heart care are attributable to the BCBS Hospital Incentive Program or other secular trends in care for heart disease (such as the CMS RHQDAPU pay-for-reporting program, the Joint Commission quality improvement initiatives, or the CMS 7th Scope of Work quality improvement efforts).
Reiter, Nahra, and Wheeler, 2006: This study reports the results of a survey of the 86 hospitals participating in the BCBS of Michigan Hospital Incentive Program. The survey measured the effect of participating in the program on hospital behavior. The study outcomes were the number of hospitals self-reporting that the incentive program had triggered a structural change or a process change within the hospital. Structural changes included the formalization of a quality management staff position or a change in the person responsible for quality. Process changes included implementation of a computerized physician order entry (CPOE) system or creation of case-management teams. Of the 86 hospitals participating in the program, 66 responded to the survey (70 percent response rate). Of the respondents, 32 (48 percent) reported that they had made a structural change and 39 (59 percent) reported they had made a process change in response to the P4P program. Overall, 75 percent of the responding hospitals reported making at least one type of change as a result of the BCBS Hospital Incentive Program. The most common structural change was involvement of leadership and greater board engagement in quality improvement. The most common process changes were instituting physician education, developing case-management teams, and increasing leverage with hospital physicians. The authors observed that since most of the process changes focused on physician behavior, a hospital’s ability to improve quality might depend on its “willingness or ability to exert influence over physicians.”
While this study found changes in the behavior of hospitals in response to the P4P program, it does not demonstrate that the changes made by hospitals resulted in clinical quality improvements. Additionally, the combination of the BCBS P4P program and other quality improvement interventions that were occurring simultaneously (e.g., CMS P4R, Joint Commission quality improvement) may have created a tipping point for the hospitals to make the reported behavioral changes. This study does not include a control group, which means there is no way to determine whether hospitals not exposed to the BCBS of Michigan Hospital Incentive Program were making similar changes.
Sautter et al., 2007: This qualitative study described the findings of semi-structured interviews with senior management and cardiologists at 10 Michigan hospitals participating in the P4P program. Fifty-four hospitals that participated in the P4P program and reported cardiac care performance to BCBSM 2002-2004 were placed into strata based on their changes in performance on one of the quality measures used in the incentive program, assessment of ventricular function among CHF. Hospitals from each strata were selected for interviews to obtain variation in hospital characteristics, such as size and teaching status. Among the 10 hospitals selected for interview, 7 had improved their performance, 2 were top performers at baseline and remained top performers, and 1 hospital showed declining performance. Only two of the 10 hospitals interviewed reported that the P4P incentives were a driver for quality improvement; eight of the 10 reported their facilities were undertaking these activities anyways or that the incentive was not large enough to be effective. The authors, however, are not sure these responses imply that without financial incentives performance would have improved to the same degree. They note, “incentive rewards clearly enabled some hospitals to make investments in quality.” In explaining the variation in quality improvement, the authors believe “underperforming hospitals with some infrastructures for quality improvement had the greatest success when presented with incentives.”
CMS–Premier Hospital Quality Incentive Demonstration
Four studies have analyzed the effects of the PHQID, a three-year CMS-sponsored demonstration project initiated in 2003. The PHQID program allowed for voluntary enrollment (i.e., hospital self-selection into the study) and only included hospitals using the Premier Perspectives data system—two factors that may hinder the ability to generalize the experience of the demonstration hospitals to non-demonstration hospitals to the extent that participants differ in important ways from non-participants. It should also be noted that at the start of the Quality Incentive Demonstration period, CMS had already begun implementing its RHQDAPU P4R program, whose set of measures overlapped substantially with that of the PHQID. The PHQID program includes 34 measures of which 22 overlap with RHQDAPU measures in the areas of AMI, pneumonia, CHF, and surgical infection prevention.
The PHQID demonstration includes 262 hospitals across 38 states. Hospitals were paid an annual bonus based on their composite performance scores in five clinical areas: AMI, Coronary Artery Bypass Graft (CABG) surgery, Community Acquired Pneumonia (CAP), CHF, and hip and knee replacement surgery. The bonus dollars represented new money. Hospitals that did not achieve a minimum level of performance in the third year of the program (defined by the lowest two deciles of performance in the first year if the program) were assessed a financial penalty.
Premier, Inc., 2006: Premier published its own report describing the PHQID and the observed quality improvements from the first year of the incentive program’s implementation. Premier reported that between the first and fourth quarters of the first year of the program (October 2003 to September 2004), significant gains were made across the measures in the study, with an average 6.6 percentage point improvement across the five clinical areas. Within each of the five clinical composites, AMI performance increased from 87.4 percent to 90.8 percent, CABG surgery performance improved from 84.9 to 89.7 percent, CAP improved from 69.3 percent to 79.1 percent, CHF increased from 64.6 percent to 74.2 percent, and hip/knee replacement improved from 84.5 percent to 90.1 percent.
Although these results are positive, it is difficult to draw conclusions from this study about the effect of the PHQID program. An important challenge with this study is trying to assess whether non-participants were achieving similar gains in performance given the absence of a control group. As documented by Williams et al. (2005), there has been a strong trend across the country toward improvement in many of the same measures used as a basis for incentives in the PHQID. Disentangling the impact of the CMS-Premier demonstration from concurrent Joint Commission and CMS quality improvement efforts (i.e., RHQDAPU and the 7th Scope of Work) requires that there be a set of comparison hospitals with similar characteristics but no exposure to the PHQID. Selection bias is another issue to contend with in explaining the observed outcomes, since Premier hospitals that chose to participate in the PHQID had higher baseline quality scores than did Premier hospitals that chose not to. Thus, improvements in performance may be stem partly from characteristics of the hospitals that participated rather than from the incentive program itself.
Grossbart, 2006: This study examined the impact of the PHQID but focused solely on a subset of hospitals participating in the Premier system. The study followed the performance of hospitals in the Catholic Healthcare Partners system—four that chose to participate in the PHQID and six that chose not to participate and were used as controls. The analysis was limited to a subset of 17 of the 34 measures used in the PHQID initiative (for three clinical conditions, AMI, CAP, and CHF) that were collected by both intervention and control groups of hospitals as part of reporting for Joint Commission ORYX Core Measures program.
All 10 hospitals showed significant improvement across the measures. Those participating in the PHQID had a greater statistically significant increase in performance than did the non-participants. Across 17 measures, PHQID hospitals improved their scores by 9.3 percentage points, versus 6.7 percentage points for non-participating hospitals. Although the researchers matched hospitals on a number of key characteristics, one important limitation of this study is that they did not match them on baseline performance. The findings are confounded by the fact that the participating hospitals started at a higher level of quality than the non-participants did (80.4 percent versus 78.9 percent).
Much of the observed difference between the two sets of hospitals was driven by greater improvement in CHF care (19.2 percentage points for PHQID hospitals versus 10.9 percentage points for non-participants). Across the 17 measures examined, the two measures with substantial differences in improvement between PHQID and non-participating hospitals were (1) discharge instructions for patients with CHF (40.1 percentage points improvement for PHQID hospitals versus 14.6 for non-participants), and (2) pneumococcal vaccine delivery for patients admitted with pneumonia (31.6 percentage points improvement for PQHID hospitals versus 22.1 for non-participants). These two measures likely drive a substantial fraction of the overall observed differences in improvement between participating and non-participating hospitals.
The PQHID P4P intervention did not occur in isolation; it was conducted in an environment in which several national quality improvement efforts already in play were focusing on the same measures, particularly the HQA measures. These efforts included the CMS RHQDAPU program, the Joint Commission’s quality improvement initiatives, and the CMS 7th Scope of Work. Across the subset of ten HQA measures, the study found that there was no difference in the amount of improvement: 5.4 percentage points for PHQID hospitals, and 5.1 percentage points for non-participating hospitals. This very modest difference, while not statistically different, raises questions about the added value of P4P incentives above and beyond other quality measurement and feedback efforts, particularly the RHQDAPU P4R intervention, which appears to have driven improvements in performance nationally (Lindenauer et al., 2007). Similar levels of improvement were observed among all hospitals nationally, both those exposed to P4P and those exposed to public reporting, measurement, and feedback interventions.
The author described why only some Catholic Healthcare Partners hospitals chose to participate in PHQID. With the exception of those with the highest volume, hospitals saw the costs of participation, particularly for the extra staff required for the additional data collection, as being too high; and most hospital CEOs believed there was little to be gained by participation. Those that chose to participate thought the experience would provide them with a market advantage and a head start given the growing numbers of P4P programs in the market.
It is unknown from this study whether the ten Catholic Healthcare Partners hospitals making up the set are similar to or different from other hospitals nationally in ways that are important. To the extent that these hospitals differ in important ways from other hospitals, the results may not be more broadly generalizable. Another unknown is how Catholic Healthcare Partners hospitals and the system in which they operate may differ from other hospitals nationally, such as in the amount and type of systems and quality resource support that were provided. The six hospitals serving as the control group were selected because of “similar levels of service,” and the hospitals were shown to be similar in terms of availability of an open heart program and average number of beds, discharges, and case-mix index. A more rigorous method of selecting controls would have been to match each intervention hospital to a control on these characteristics as well as on baseline performance.
Lindenauer et al., 2007: This study provides the most comprehensive evaluation of the impact of the PHQID that has been published to date. The paper describes changes in performance on 10 measures that occurred over a two-year period, between the fourth quarter of 2003 and the third quarter of 2005. The study examined 207 PHQID hospitals and 406 control hospitals that were submitting performance data as part of the RHQDAPU program. Hospitals in this study were matched on bed size, teaching status, region (Northeast, Midwest, South, or West), location (urban or rural), and ownership status (for-profit or not-for-profit).
On an overall composite measure constructed from the 10 measures, PHQID hospitals experienced greater improvement than the control hospitals did (9.6 percentage point improvement versus 5.2 percentage points). This difference was seen consistently for each of the three clinical conditions (AMI, CAP, and CHF) for most individual measures and on an appropriate care measure.3 The greatest amount of improvement was seen among hospitals with the lowest baseline performance.
The authors did a number of sensitivity analyses to assess whether this differential response stemmed from a volunteer bias, meaning that Premier Perspectives hospitals that volunteered to select into the PHQID program were inherently different from Premier Perspectives hospitals that did not volunteer. The researchers found that after controlling for baseline performance and volume of patients, the difference in improvement decreased from 4.3 percentage points to 2.9 percentage points, but the improvement was still statistically significantly higher in PHQID hospitals. When all hospitals eligible to participate in the PHQID program were compared to all other hospitals nationally (so those exposed to RHQDAPU), the performance differential remained, but the gap was smaller (the difference in absolute performance point improvement was 2.1 points). Overall, this article provides the strongest evidence that the PHQID is improving performance beyond what is accomplished by public reporting of performance for some of the 10 measures, albeit modestly, once the hospitals’ baseline performance and characteristics are controlled for. Because this study describes the impact of the P4P intervention on top of the measurement and public reporting intervention, we do not know how the impact of the P4P intervention would have differed absent public reporting.
Glickman et al., 2007: This study examined the impact of the PHQID on hospitals voluntarily participating in the national quality improvement initiative Can Rapid Risk Stratification of Unstable Angina Patients Suppress Adverse Outcomes with Early Implementation of the American College of Cardiology/American Heart Association (ACC/AHA) Guidelines (CRUSADE). Hospitals participating in CRUSADE received performance feedback, including comparisons with other CRUSADE hospitals and national standards, as well as a variety of educational interventions. Trends in the cardiac care of patients with non-ST-segment elevation AMI from July 2003 to June 2006 were compared for 54 CRUSADE hospitals participating in PHQID and 446 CRUSADE hospitals not participating in PHQID (i.e., controls). In addition to the AMI measures included in PHQID, the comparison also used eight AMI process measures not included in the demonstration. The study sought to determine whether participation in the P4P intervention gave an additional boost to performance improvement above that from the CRUSADE intervention.
Both PHQID and control hospitals improved performance on PHQID measures and the other AMI measures over the period examined. There were not statistically significant differences between improvement in the PHQID and control groups on the composite measure for either PHQID (7.2 percentage points and 5.6 percentage points, respectively) or other AMI measures (13.6 percentage points and 8.1 percentage points, respectively). PHQID hospitals had significantly greater improvement on three individual measures—two that were included in PHQID (aspirin prescribed at discharge, p = .04; smoking cessation counseling for active or recent smokers, p = .05) and one that was not included in the demonstration (lipid-lowering agent prescribed at discharge, p = .02). There were no statistically significant differences in improvements in inpatient mortality between the two groups. In both groups, hospitals with lower levels of performance at the start of the observation period demonstrated greater improvements in performance than did higher-performing hospitals.
The authors concluded that P4P leads to only very small improvements in performance beyond what can be accomplished through engagement in quality improvement initiatives. Like the Lindenauer et al. (2007) article, the Glickman et al. article demonstrates the importance of using control hospitals and controlling for baseline performance in any analysis of the impact of hospital P4P. This study’s limitations are its focus on only one of the clinical areas included in PHQID and its narrow focus on patients with non-ST-segment elevation myocardial infarction. In addition, since the hospitals included in the study voluntarily participated in CRUSADE, it is not known whether hospitals would demonstrate the same level of performance improvement if participation were not voluntary.
Summary of the Evidence on Hospital P4P Programs
As of June 2007, there were only nine studies on the impact of hospital P4P programs, one of which was not peer reviewed. All of these studies evaluated programs that targeted the inpatient setting, and none examined P4P interventions in the hospital outpatient setting. Among the studies examining changes in performance, each one reported improvements over time in at least some of the hospital performance measures or condition-specific composites included in the specific study; however it is difficult to disentangle the P4P effect from the effect of other quality improvement efforts that were occurring simultaneously. Improvements in hospital performance have been observed in response to feedback reports (Williams et al., 2005) and public reporting with a financial incentive for submitting data (Grossbart, 2006; Lindenauer et al., 2007).
The two studies with control groups saw very modest improvements in performance associated with P4P compared with what was accomplished with public reporting (Grossbart, 2006; Lindenauer et al., 2007), but one of these studies saw improvements in a few performance areas associated with P4P compared with what was seen for control hospitals participating in voluntary quality improvement activities (Glickman et al., 2007). It has been argued, however, that in order to accomplished sustained quality improvement, interventions should be multifaceted and focus on different levels of the health care system (Grol et al 2002; Grol and Grimshaw 2003). This implies that to be most effective, P4P should be partnered with other activities such as public reporting and internal quality improvement activities that also encourage quality improvement for the same clinical area.
There is less evidence of the effect of P4P on patient outcomes. Berthiaume et al. (2006) found improvements in complication rates for obstetrical and surgical patients in an uncontrolled study but did not report whether those improvements were statistically significant. In the study by Glickman et al. (2007), they did not find significant differences in inpatient mortality improvement for AMI between PHQID and control hospitals. None of the studies evaluating PHQID separately analyzed the other patient outcome measures (for coronary bypass survey and hip and knee replacement surgery) included in the program, so it is not clear whether improvements occurred in these measures.
Most of the published studies have significant methodological limitations. Six of the nine had no controls, which are critical for providing evidence of a link between P4P and performance improvements. This is particularly important given the documented temporal trend toward increasing performance on many hospital quality metrics. It is challenging to disentangle the effects of the increasing use of financial incentives from the effects of greater use of quality improvement initiatives on the local and national level as well as the increasing use of public reporting when all activities are focused on the same clinical conditions. One of the studies that used a control group only included six control hospitals, and it is unclear whether the controls utilized were appropriate.
Beyond the specific limitations of the nine studies, another important issue is whether the experience of these geographically confined incentive programs that took place in the context of established relationships between the individual hospitals and the program sponsors would reflect the experience of wholesale national implementation of a hospital P4P program by Medicare. Medicare is the largest payer of inpatient care in the nation, accounting for 30.4 percent of third-party payments for hospital expenditures (CMS, 2007b). Given the importance of this revenue source for hospitals, it is possible that the level of engagement by hospitals in a national P4P program would be higher than that experienced in the programs in Michigan and Hawaii; though in both Hawaii and Michigan, the incentive program was administered by the dominant commercial payor in `each of those states. Another issue to consider when interpreting the impact of these smaller P4P programs and demonstrations is that they all generally focus on a small set of process measures covering a handful of diagnoses. It is unknown what the impact on raising quality performance more broadly might be if Medicare were to adopt a more comprehensive set of measures.
The published literature on the use of financial incentives in health care is sparse and provides little information about how specific design features may influence behavioral responses. P4P is common in industries other than health care, and economists and management experts have studied and developed theories on how individuals respond to financial incentives. In the sections that follow, we describe theories that are drawn from the economics and management literature and consider the implications of applying the findings from tests of these theories to the design of a P4P program. Our review is not exhaustive; instead it focuses on selected theories to illustrate how theory might inform program design to achieve the desired behavior changes. It should be noted that the theories described have examined the behavioral responses of individuals, not institutions. It is thus uncertain whether application of these theories would elicit the same type of behavior responses from organizations, such as hospitals.
Prospect Theory and the Role of Framing in Decisionmaking
P4P incentives are designed to change the behavior of providers and the systems in which they operate in ways that will improve quality or efficiency. Various factors, such as the size of the incentive, are likely to influence a hospital and its physicians’ behavioral responses to a P4P program. For example, a large incentive would likely lead to a larger behavioral response than would a small incentive. Another factor is how an incentive is structured, or “framed,” which can determine the behavioral response to it. Prospect theory is an economic theory that attempts to explain how individuals respond to the framing of choices (Kahneman and Tversky, 1979). What follows is a description of several applications of prospect theory and an exploration of the potential implications for structuring a P4P program.
Withholds May Have More of an Impact Than Bonuses
One aspect of prospect theory is the principle of “loss aversion,” which finds that individuals are more sensitive to incentives when they perceive they are losing as opposed to gaining something. This effect has also been described as “losses loom larger than gains.” This behavioral effect has been demonstrated in a series of experiments in which both doctors and patients are asked to make a choice of treatment—either surgery or radiation—for a patient with lung cancer. Both doctors and patients made different choices depending on whether the choice was framed as a loss (the probability of dying after surgery) or as a gain (the probability of surviving after surgery) (McNeil et al., 1982). In another experiment, Meyerowitz and Chaiken (1987) showed that a pamphlet that framed the benefits of self–breast examinations as a loss (lost ability to detect cancer early) led to a greater increase in the percentage of women doing these examinations than did an identical pamphlet that framed the benefits as a gain (gained ability to detect cancer early). The difference in the behavioral response for a choice framed as a loss rather than as a gain can be significant, almost twofold in magnitude (Kahneman and Tversky, 1979).
The principle of loss aversion may have implications for structuring a P4P incentive payment. Incentive payments can be structured as a withholding (a perceived loss in income)—for example, a portion of the hospital’s full payment for a service could be held back until the end of the measurement period and then released only if the hospital met the performance target—and they can be structured as a bonus (a perceived gain). The theory of loss aversion suggests that if the goal is to drive hospitals to make changes that improve quality or efficiency, withholding dollars with the likelihood of later releasing them based on performance (i.e., framing the incentive as a possible loss) may lead to a greater behavioral response than framing the incentive as a “gain,” in the form of a bonus, even if the same amount of money is at risk.
While framing something as a loss rather than a gain may result in a larger behavioral response, experiments have shown that doing so generally causes a negative reaction and violates what the parties exposed to the incentive believe to be fair. This point was illustrated in a study in which subjects were asked to respond to two decision scenarios. The economic impact of the two scenarios was the same, but one was framed as a loss, the other as a gain. In the first scenario, subjects were told that there was no inflation in the community and that employees were being asked to take a 7 percent wage cut (a loss). In the second scenario, subjects were told that there was 12 percent inflation and that employees were being given a 5 percent raise (a gain). The result in both of these decision scenarios was the same—employees would all experience a 7 percent reduction in net earnings—but the emotional response differed. A majority of subjects (62 percent) judged the first scenario to be unfair, whereas only 22 percent thought the second was unfair (Kahneman, Knetsch, and Thaler, 1986).
In terms of P4P program design, this research suggests that hospitals would be more likely to perceive a bonus in a positive light than they would a payment withholding, even if the net financial impact is the same. This conclusion is supported by a finding from a recent survey of 79 physician group leaders: When given a choice in the structure of a P4P program, 59 percent preferred a bonus, 24 percent preferred a withholding, and 17 percent felt they were the same (Mehrotra et al., 2007).
A Series of Small Incentives Might Lead to More Quality Improvement Than Would One Large Incentive
Why do people go across town to save $10 on a clock radio but not to save $10 on a large-screen TV? After all, the same amount of money can be saved in both cases.
The explanation for the difference in behavioral response in these two scenarios is called the principle of “diminishing marginal utility” (Lowenstein, 2001): the perceived value of a sum of money becomes progressively lower when associated with an increasingly larger sum of money. Thus, for example, an individual perceives the difference between $0 and $10 as being greater than the difference between $100 and $110, which is perceived as being greater than the difference between $200 and $210, and so on. This principle asserts that people tend to judge such gains or losses as changes from their current state of well-being (or reference point), rather than their final states (Kahneman and Tversky, 1979).
When we apply these findings to hospital P4P program design, it may be more psychologically motivating to provide smaller, more-frequent incentive payments than to provide a larger, lump-sum incentive payment. As an example, consider that a total of $1,000 in incentives is to be provided to a hospital based on its performance. According to the principle of diminishing marginal utility, the hospital’s behavioral response is likely to be greater if the $1,000 is divided into a number of payments—say, ten payments of $100 each—rather than paid as a lump sum. The reason for the greater motivation is that each $100 is perceived as a new $100 gain, capitalizing on the steepest portion of the utility function (the difference between $0 and $100), rather than simply as an addition to the previous gains (for example, from $500 to $600) (Thaler, 1985).
One way to structure this type of incentive in a P4P program would be to link the incentive payment to each applicable hospitalization. For example, the hospital could receive an extra payment of $100, on top of its usual DRG payment, for every patient admitted for pneumonia that received the care designated by the quality measure(s). This approach could lead to a greater behavioral change by the hospital than if it were to receive a lump sum, equal in dollar value, at the end of the year.
Uncertainty May Reduce the Behavioral Response
When given a choice, most people are risk averse; they will choose an option with 100 percent certainty over an option involving an uncertain but likely more valuable outcome. This principle of risk aversion is illustrated in a study in which subjects were given a choice between a one-week vacation that was certain or a three-week vacation they had a 50 percent chance of winning. The vast majority of subjects chose the one-week vacation (Kahneman and Tversky, 1979). Even though the 50 percent chance of a three-week vacation might be considered a more rational choice, most people will choose the sure thing because they perceive it to be a better choice than the possibility of getting nothing at all.
With regard to P4P program design, the principle of risk aversion suggests that decreasing the risk or uncertainty in the likelihood of receiving a financial incentive is likely to lead to a greater behavioral response to the incentive. Some P4P payment structures use relative thresholds, such as paying those in the top quartile of performance, as the basis for determining who “wins.” This type of payout scheme creates greater uncertainty for hospitals than do payment schemes that use absolute thresholds (i.e., a fixed target) for determining who receives an incentive payment. The reason for the greater uncertainty with relative thresholds is that the level of performance necessary to earn the incentive is unknown until after the fact, when hospitals can be sorted by rank order of performance. In contrast, absolute thresholds known in advance and thus provide greater certainty to the individual or institution trying to hit the target. Because of the uncertainty they create, relative thresholds may reduce the behavioral response to an incentive more than an approach using an absolute threshold will. Similarly, a shared saving program, such as is being used in the CMS Physician Group Practice (PGP) demonstration, might lead to a reduced behavioral response, in this instance because the providers in the PGP face uncertainty about whether there will be cost savings to fund incentive payments. In contrast, the most certain incentive would be an adjustment to the fee schedule. For example, for every admission for myocardial infarction, a hospital would receive an extra $100, on top of its DRG payment, if the patient received all applicable processes of care. In such an incentive system, the hospital would know that if its physicians provide these processes, it would definitely obtain the additional payment.
Reducing the Time Lags Between Performance and Receipt of Incentive Can Help to Achieve Maximum Response
In economics, the principle of discounting is based on the fact that individuals value having a sum of money now more than sometime in the future, even after accounting for inflation. The concept of discounting and the use of a discount rate are well accepted in both accounting and economics. Studies have found, however, that individuals discount in a way different than would be expected by classic economic theory. In one study, the vast majority of individuals chose to receive $10 immediately rather than $21 in one year (Loewenstein and Prelec, 1992). But when asked to choose between $10 in one year and $21 in two years, fewer individuals selected the $10. Instead of discounting in a linear fashion, the individuals in these experiments were discounting at a steeper hyperbolic curve, which led to the name of this phenomenon: hyperbolic discounting.
The application of hyperbolic discounting to P4P program design suggests that minimizing the lag time between the performance being incentivized and receipt of the incentive may strengthen the behavioral response. Money received right away is perceived as different in value from money to be received in the future—even the near future. For example, a hospital is more likely to implement an electronic medical record (EMR) if they know the money associated with doing so will be received quickly (e.g., within the next month) rather than years after the implementation. One criticism of current performance measurement and reporting programs is that the substantial lag between the provision of care (i.e., performance) and the reporting of results renders the results not actionable (Davies, 2001). Similarly, in a P4P program, the time required to collect and validate data and make the payout determination might mean that the incentive payment comes long after actual delivery of care. Substantial time lags may cause a hospital to see the incentive as occurring so far in the future that it is not worth pursuing. Strategies that tie payment to the provision of individual services or more frequent payouts may help reduce the time lag.
A Series of Tiered Absolute Thresholds May Be Better Than One Absolute Threshold
An individual’s motivation and effort when faced with a goal greatly depend on that individual’s baseline performance. Economists and psychologists have described this phenomenon as a “goal gradient” (Heath, Larrick, and Wu, 1999). If baseline performance is far away from goal performance, the individual exerts little effort, because the goal is viewed as not immediately attainable. As baseline performance gets closer and closer to goal performance, the individual exerts more and more effort to succeed. However, as soon as the goal is achieved, the motivation to improve decreases significantly. This phenomenon was illustrated in a study of a coffee shop reward program in which the tenth coffee purchased was free. Participants in this experiment slowly decreased the time between purchases of a coffee as they got closer to the free coffee (Kivetz, Urminsky, and Zheng, 2006).
The notion of a goal gradient may have application in structuring a hospital P4P program. This principle implies that there would be a greater behavioral response among hospitals if there were a series of quality performance thresholds to meet (e.g., increasing dollar amounts for achieving a 50 percent, a 60 percent, a 70 percent, an 80 percent, and a 90 percent performance threshold) rather than one (e.g., a 75 percent performance threshold). If, for example, there is just one 75 percent quality threshold (rather than a series of thresholds), a hospital whose baseline performance is at 45 percent is likely to see the goal as too difficult and not likely to be achieved, and is thus less likely to devote resources to quality improvement. If there is also a 50 percent quality threshold, however, the hospital’s leadership may see reaching the threshold as feasible and thus be more likely to devote resources to improving quality. A series of quality thresholds might also lead to a different behavioral response among hospitals that are doing well. In a single-threshold system with a goal of 75 percent, a hospital that is at 80 percent would have little reason to devote more resources to improve its quality performance any further. In a graded performance threshold system, however, this hospital would have an incentive to reach the highest threshold, 90 percent, to achieve additional payment. To stimulate continual improvement, some P4P programs have elected to use relative performance targets so that the bar keeps moving upward. However, absent some gradients or some allowance for payment along the entire continuum of improvement, a single relative threshold creates a cliff effect—meaning all or nothing winners.
Multidimensional Output
Multidimensional output, or multitasking, refers to situations in which the responsibilities of an individual encompass multiple activities or outputs that may require different types of skills to accomplish (Holmstrom and Milgrom, 1991). A hospital’s output includes many different components, such as managing a patient’s chronic illness, the timely and efficient diagnosis of a patient’s new symptom, counseling and advice on how to prevent illness, and emotional support.
Multitasking is relevant to P4P programs because the performance measures in these programs typically address only a narrow piece of a hospital’s outputs or the processes that contribute to outputs. For example, a program may measure the provision of aspirin for a patient with AMI but not other processes or outputs that are difficult to measure, such as diagnostic acumen for a patient hospitalized with unclear symptoms. It is hypothesized that if a large incentive is applied to one type of output, other outputs will be neglected, and overall care might worsen (Holmstrom and Milgrom, 1991). This reasoning is used to explain why few private-sector corporations put large fractions of employee pay “at risk,” making them dependent on measures of output for which only a small fraction of what contributes to output can be measured (Asch and Warner, 1996). A large financial incentive based on a narrowly focused set of measures may lead to the unintended consequence of having a hospital “teach to the test,” devoting resources to those things being measured and neglecting other important outputs that are not being measured.
There are several potential ways to overcome or minimize the problem of multitasking. One is to create an incentive program that addresses a broad array of a hospital’s outputs by applying a comprehensive set of performance measures. This approach has been taken by the primary care physician P4P incentive program in the United Kingdom, which has over 146 quality indicators covering clinical care for ten chronic diseases, organization of care, and patient experience (Doran et al., 2006). The challenge with this approach is to avoid creating a program that may be overly complicated and costly—absent efficient measurement tools. Another approach that employers in other industries have used is low-powered incentives (Asch and Warner, 1996). With this approach, the majority of an employee’s income is fixed, and only a small fraction is tied to an incentive. The incentive emphasizes the importance of the measured area but is not large enough to induce undesirable behaviors, such as gaming of the data to win or avoiding caring for sicker patients.
Intrinsic Versus Extrinsic Motivation
Empirical meta-analyses of studies that examined incentive programs show that such programs have a mixed response; some studies show an impact, and many others show little or even a negative impact (Rothe, 1970; Deci, Koestner, and Ryan, 1999; Cameron, Banko, and Pierce, 2001). Researchers have tried to reconcile the mixed results by theorizing that they are caused by a conflict between intrinsic motivation, which is a person’s inherent desire to do a task, and extrinsic motivation, which is the external incentive—such as might be provided in a P4P program. Researchers theorize that instead of supporting intrinsic motivation, extrinsic incentive “crowds out” intrinsic motivation (Deci, Koestner, and Ryan, 1999). This theory is used to explain why financial incentives for blood donation are ineffective: they inhibit the altruistic benefit of blood donation (Titmuss, 1970). The explanation for this crowding-out effect is that when a task is tied to an extrinsic incentive, people infer that the task is difficult or unpleasant (Freedman, Cunningham, and Krismer, 1992).
Empirical evidence of this effect was provided by a study in which students who were asked to collect money for a charity were put into two groups, one that was given an external incentive (a small amount of money), and one that was not. The group that was given the incentive collected less money than the other group did (Gneezy and Rustichini, 2000). A meta-analysis supported this study’s finding that performance-contingent rewards significantly undermine intrinsic motivation (Deci, Koestner, and Ryan, 1999), but the finding is not without critics (Cameron, Banko, and Pierce, 2001). Similar concerns have been raised about the effect of P4P in health care and how it may violate a physician’s sense of professionalism (Berwick, 1995). Application of this theory would imply that a small P4P incentive could actually lead to lower performance if it is tied to something hospitals are intrinsically motivated to improve, such as quality of care.
A potential way to address the crowding out of intrinsic motivation is simply to increase the size of the financial incentive. A very large external incentive will crowd out any inherent intrinsic motivation; but, in turn, it may create a greater behavioral response than would be obtained through intrinsic motivation alone. Gneezy and Rustichini, in “Pay Enough or Don’t Pay at All” (2000), illustrated this concept in a study of the average percentage of correct answers on an IQ test for fo