Increasing Organ Donation and Transplantation: The Challenge of Evaluation. II. Evaluation Methodology


Despite the diversity of design, costs, and other factors, the aim of evaluation methods in use today is essentially the same, i.e., to assess the effect of an intervention on one group compared to the effect of a different intervention (or no intervention) on another group. By definition, all evaluations have a control or comparison group. Exhibit 2 depicts a basic framework for considering the methodological rigor of evaluation types, their respective study elements, and examples of organ donation activity evaluations.

The evaluation types in Exhibit 2 are listed in rough order of most to least scientifically rigorous for internal validity, (i.e., for accurately representing the causal relationship between an intervention and an outcome in the particular circumstances of a study). This ordering of methods assumes that each study is properly designed and conducted; a poorly conducted large RCT may yield weaker findings than a well conducted study that is lower on the design hierarchy. This list is representative; there are other variations of these methodologic designs and some investigators use different terminology for certain methods (Appendix A contains definitions of the evaluation types listed in Exhibit 2).

As Exhibit 2 depicts, every evaluation has strengths and weaknesses. There are typically trade-offs involving rigor, cost, and feasibility. The importance of this trade-off depends largely on the activity under study and the goals and resources of the organization conducting the evaluation. For example, it is possible to design a prospective study to evaluate the effect of a national media campaign on actual organ donation rates. However, the time and resources (e.g., money) needed to track millions of people over time to capture what may be small differences in donation rates might not be feasible. A prospective evaluation design may more appropriately be used to assess post-event activities because fewer people (i.e., only those who become potential donors) have to be tracked over a shorter period of time to capture a change in the donation rate. For example, a prospective study might feasibly assess decoupling the discussion of organ donation from the announcement of brain death on consent rates.

Given resource and time constraints, and the difficulties associated with randomizing and perfectly controlling "real-world" studies, it may not be possible, and is often not practical to conduct a randomized controlled study. However, all evaluations can include elements that strengthen the methodology of the study and produce more rigorous results. For example, a more valid comparison group often can improve study designs. In a time-series study one group (e.g., a hospital) is measured at baseline (e.g., consent rate, donation rate) and subjected to an intervention (e.g., an in-service provider education program) and re-measured at several intervals to assess changes in performance indicators. A more rigorous control would be an external control, for example a hospital not receiving the in-service program. The control hospital is subject to the same external influences as the study hospital (e.g., a concurrent mass media campaign) and thus the effects of the intervention can be measured more accurately. Examples from the behavioral literature provide insight on how to select and randomize control groups for "health behavior interventions" similar to changing personal behavior with regard to organ donation (Appendix C).

Exhibit 2: General Strengths and Weaknesses of Evaluation Types
wpe15.gif (37137 bytes)

Source: Lewin, 1998

The following general guidelines are helpful for weighing the relative rigor of alternative types of controlled studies.

  • Randomized studies are stronger than non-randomized studies.

Randomized studies require the assignment of subjects to intervention and control groups based on a chance distribution. This technique is used to diminish subject selection bias in controlled studies.

  • Prospective studies are stronger than retrospective studies.

In a prospective study, the investigators conduct an investigation on a group of subjects and analyze the outcomes. In a retrospective study, investigators select groups of subjects who have already been subject to an intervention and analyze how the intervention relates to the outcomes.

  • Large studies are stronger than small studies.

The sample size of a study should be large enough to have an acceptable probability of detecting a difference in outcomes, if such a difference truly exists, between the experimental and control groups attributable to the intervention being evaluated. Although larger studies increase the statistical power of the evaluation, there is a point beyond which there are diminishing returns and studies may become unnecessarily costly and inefficient.

  • Contemporaneous controls are stronger than historical ones.

A contemporaneous control group exists when the results of an intervention group and a control group are compared over the same time period. An historical control group exists when the results of an intervention group are compared with the results of a control group observed at some previous time.

  • External controls (multiple-group designs) are stronger than self-controls (one-group designs).

A multiple-group design exists when comparisons are made between one group receiving the intervention and one group not receiving the intervention (control). A one-group design exists when the experience of a single group is compared before (control) and after an intervention.