The Evaluation of Abstinence Education Programs Funded Under Title V Section 510: Interim Report . Scientific Rigor in the Study Design


The scientific rigor of the impact study design rests on four key elements.  It begins with the selection of strong, well-implemented, replicable program models.  Second, the impact study uses a rigorous experimental design to create program and control groups within each site.  Third, the sample enrollment period was long enough to generate adequate sample sizes to support reliable impact estimates.  Finally, the impact evaluation includes a reasonable length follow-up period to ensure that relevant changes in behavioral outcomes can be measured.

The impact evaluation examines five programmatic strategies geared to the needs of the local communities (Table 3).  Measuring impacts for a range of program models promotes the goal of identifying and documenting abstinence education strategies appropriate to varied local needs and contexts.  For example, the Florida and Wisconsin programs serve mainly youth from single-parent households; these programs are intensive and include strong components on relationship development and maintenance, as well as understanding and appreciation for the institution of marriage.  In the Mississippi program site, many youth live in large, multigenerational households isolated from the broader community.  The program in this community is delivered through the schools and emphasizes both basic knowledge development and components focused on managing peer pressure.  Youth in the South Carolina and Virginia programs live in communities that mirror “middle America.”  The program in Virginia is a low-cost, school-based intervention, while the one in South Carolina is a more comprehensive and intensive youth development initiative.  These choices of program strategies reflect community characteristics and perceptions of how best to serve youth, given local needs and the resources and constraints of the partner schools.

One implication of the variation in program interventions and services is that it is not possible to reach a single judgment about the efficacy of abstinence education.  Such a judgment would only be possible if there were a single, well-defined intervention, one that could vary in its “dosage” across sites but is similar in nature across all sites.  In the case of the Section 510 abstinence education programs, however, the interventions and services vary considerably across program sites and sometimes even within a program site.  In the absence of definitive evidence on the efficacy of a specific abstinence education approach, this variation is a natural result of the funding opportunities available through Title V Section 510.  In addition, the variation in the abstinence education programs provides the opportunity to learn about the effectiveness of different programmatic strategies.

Program Location Program Intervention Control Group Services
Table 3: Program Interventions and Services Received by the Control Groups
FL (Miami) Elective class offered daily, all year to girls in middle schools (ReCapturing the Vision and Vessels of Honor) Other elective class
MS (Clarksdale) Mandatory weekly year-long abstinence education curriculum (Revised Postponing Sexual Involvement and Sex Can Wait) Regular health class
SC (Edgefield) Five-session mandatory curriculum with voluntary enrollment in weekly or biweekly character clubs (Heritage Keepers) Five-session mandatory abstinence curriculum without character clubs
VA (Powhatan) 36-session mandatory curriculum (Reasonable Reasons to Wait; The Art of Loving Well; and Choosing the Best) Regular health class
WI (Milwaukee) Voluntary after-school program; two hours daily all year for multiple years (Families United to Prevent Teen Pregnancy) Regular after school programs; no special services

The impact evaluation uses an experimental design.  In an experimental design study, program slots are filled by youth who are selected at random from a larger pool of eligible and appropriate youth (Figure 2).  Random assignment procedures divide youth into a program group that has access to the abstinence education program and a control group that does not receive the program, but may receive regular or alternative services.  The contrast in services being studied varies depending both on the nature and intensity of the program services and the experiences of the control group (see Table 3).

Figure 2. Study Sample Enrollment and Tracking

Figure 2. Study Sample Enrollment and Tracking

Longitudinal tracking of both the program and control group youth begins at the time of sample enrollment and continues for 18 to 36 months, depending on the time of initial enrollment.  The comparison of outcomes for these two groups over time provides the basis for judging impacts of the program.

The experimental design offers the best means of measuring, with a known degree of certainty, how successful the programs are overall and how well they serve key subgroups of youth within a site.  This is because, with careful implementation, the only systematic difference between the program and control youth should be their access to the program.  As a result of the random assignment, the program and control groups have similar demographic and background characteristics within any study site (Figure 3) and they are exposed to a common school and community context.

Figure 3. Demographic and Background Characteristics are Similar for Program and Control Youth Within Each Site.

Figure 3. Demographic and Background Characteristics are Similar for Program and Control Youth Within Each Site.

However, the characteristics of sample youth vary across study sites due to a combination of factors, including program targeting practices and differences in the program and community characteristics.  For example, the average age of youth at the time of sample enrollment ranges between 10 in the Wisconsin program site, which delivers its services through an after-school program, to 13 in the Virginia program site, which serves exclusively eighth graders.  The proportion of sample youth who are nonHispanic black ranges from a low of 12 percent in the Virginia program site to over 80 percent in two other programs, one of which operates in a rural southern community, the other in an inner-city setting.  The proportion living in two-parent families ranges from 37 percent to more than 75 percent.

Random assignment generates, in each study site, program and control groups consisting of youth who, on average, are subject to similar family rules and express similar attitudes and values about abstinence before the program group is exposed to abstinence education services (Figure 4).  For example, the proportion of youth who say their parents have strict rules about companions they spend time with varies across sites between 15 and 45 percent, but is similar for program and control youth within each site.  Between 62 and 83 percent of sample youth in each study site reported believing that “having sex as an unmarried teen would make it harder to subsequently have a good marriage,” and between 16 and 35 percent hold the view that “having sex is a way to tell someone you love them.”  In all cases, however, the views of program and control youth are nearly identical within each site.

Figure 4. Family Rules and Attitudes about Teen Sex are Similar for Program and Control Youth at Sample Enrollment.

Figure 4. Family Rules and Attitudes about Teen Sex are Similar for Program and Control Youth at Sample Enrollment.

A major advantage of the random assignment design is that it protects against selection bias in the impact estimates for the individual programs studied.  Other evaluation designs are vulnerable to selection bias, which can seriously undermine the credibility of their results.  Some evaluations, for example, have relied on comparisons of outcomes for participants in “elective” programs and youth at the same site who, for some reason, do not participate.  Others compare outcomes for program youth with youth who responded to local or national surveys.  In both cases, there is a strong possibility that the participants differ in some preexisting but unobservable way from the comparison group.  These preexisting differences may lead to biased estimates of program impacts.

Pre-post comparison designs have other defects.  Comparisons of measures for participant groups before and after their involvement in a program can be affected not only by the program but also by natural maturation effects.  For example, data from the National Longitudinal Survey of Adolescent Health show that the percentage of teens who have ever had sex increases from 9.6 percent at age 13 to 19.6 percent at age 14.  Thus, using a pre-post design to measure program impacts on abstinence would seriously bias the results toward estimates of no impacts or possibly even adverse impacts.

Studies that rely on comparison samples drawn from existing survey databases can be weakened by both bias and unreliability.  Some studies, for example, compare program participants with respondents to the Youth Risk Behavior Survey or the National Longitudinal Survey of Adolescent Health.  Such study designs have the added complications arising from noncomparability of survey instruments, data collection methods, and timing of the data collection (Santelli et al. 2000).

Carefully designed and implemented experimental design studies can both overcome these weaknesses and offer unanticipated bonuses for programs and policymakers.  When program resources are not sufficient to serve everyone, many youth will not receive the abstinence education program services, regardless of whether there is an experimental-design evaluation or not.  Random assignment is often fairer than commonly used practices such as “first come, first served” or referral systems to allocate scarce program resources.  Random assignment designs also can provide valuable information about the magnitude of “unmet” demand for the program services.  Assuming that the evaluation design is implemented so that programs operate at capacity, the size of the control group provides a lower-bound estimate of unmet demand.  At the same time, the operational experience with outreach and recruitment provides qualitative information regarding how thorough and successful the outreach efforts are and may provide tips on how to strengthen future outreach efforts.

One limitation of a random assignment design for measuring program impacts arises if any of the programs has major spillover effects.  If, for example, youth who are assigned to the program group interact with youth in the control group in ways that transfer the benefits of the program intervention to peers in the control group, the random assignment study design will underestimate program impacts.  Similarly, if the presence of an intervention in the school or community significantly alters the overall school or community climate in important ways, this could lead to underestimates of program impacts.  The overall judgment of the evaluation team is that, for each of the five sites included in the impact evaluation, spillover effects are expected to be very small in relation to the direct effects on those who participate in the program.  Nonetheless, this is an issue that has received ongoing attention by the evaluation team and that is addressed in the follow-up surveys with students.2

The impact evaluation has large sample sizes of between 400 and 700 youth per site.  Large sample sizes protect against the possibility of failing to detect true program impacts simply because the study lacks statistical power.  It is important that, if no statistically significant program-related impacts are detected on sexual activity or on risks of STDs or pregnancy, for example, one of two conditions holds:  (1) there really was no impact of the program at all, or (2) any program impact was sufficiently small as to be of no importance to policymakers or practitioners.

What constitutes a sample size large enough to detect true impacts depends in large part on the nature of the program.  Generally, low-intensity or short programs have smaller impacts and, thus, require larger sample sizes to ensure that true impacts are picked up in the analysis.  The opposite is generally true of programs that are longer or more intensive.

The originally planned one-year period of sample enrollment for the evaluation was extended to three years in order to generate samples large enough to ensure detecting meaningful program effects and to avoid false claims of no effects.  Final sample sizes per site are expected to vary between 443 (280 program/163 control) and 700 (371 program/329 control) students.  Table 4 presents estimates of changes in outcomes the study will be able to detect using reasonable standards of statistical power and precision, given these sample sizes and given national estimates of the prevalence for selected outcomes.  For example, the study will be able to detect true program impacts on the percentage of students who are sexually experienced of 7.2 percentage points or larger in the site with 700 youth in the study sample and of 11.2 percentage points or larger in the site with 443 youth in the sample.

Table 4: Minimum Detectable Changes in Outcomes
Outcome Measure (Wave 3) Estimated Prevalence
of Outcome(a)
Minimum Change Detectable(b)
Largest Sample Smallest Sample
Taken Virginity Pledge 14.9% ±6.0% ±9.3
Sexually Experienced 24.1% ±7.2% ±11.2
Abstinent at Follow-up(c) 86.5% ±5.8% ±8.9
At Risk of Pregnancy(d) 17.3% ±6.4% ±9.8
Sample Sizes 700 443
  • Program Group
371 280
  • Control Group
329 163
a.  These estimates are based on computations from the National Longitudinal Survey of Adolescent Health data.  National prevalence estimates for youth at different ages have been weighted by the age distribution of the Title V Section 510 abstinence education program evaluation sample in the construction of these estimates.

b.  Minimum detectable differences are calculated based on the actual sample sizes, adjusted for anticipated nonresponse to follow-up surveys.  A 95 percent confidence interval and an 80 percent power requirement were used.

c.  Defined as never had sexual intercourse or not sexually active in past 90 days.

d.  Defined as sexually experienced and did not use a highly effective method of contraception at last intercourse.

To guard against errors that might arise based on findings from small sample sizes with low statistical power, no impact evaluation results will be released until data for the full study sample are available.  Results based on just the first one or two years of sample enrollment would run a risk of missing true impacts simply because of small sample sizes.

The study sample is being followed for up to 36 months.  The data collection schedule balances the need to release study findings at the earliest point possible with the importance of ensuring that study findings offer reliable guidance for policy and practice decisions.  Two waves of follow-up surveys are planned.  The wave 2 follow-up survey is being administered 6 to 12 months after initial study enrollment (when the wave 1 baseline survey was administered), and the wave 3 follow-up survey will be administered between 18 and 36 months after enrollment.  The interval between sample enrollment and the wave 3 survey depends on the age of youth at enrollment and the latest calendar date when surveys can be administered given the reporting schedule.  Under this plan, it is possible to analyze both short-term impacts on knowledge, attitudes, and intentions of youth related to abstinence and longer-term impacts on behavior.

Because so few youth engage in sexual activity before entering high school, outcome estimates based on wave 2 outcome data from middle-school years would miss program impacts on behaviors that most often would emerge at later ages.  Indeed, a shortcoming of previous abstinence education evaluations has been a follow-up period that does not extend beyond the middle school years.  Nationally, only 12 percent of males and 8 percent of females under age 13 have ever had sex (tabulations of the National Longitudinal Survey of Adolescent Health).  It is important to have the data collection period extend as long as possible in order to measure behavioral outcomes at ages where the prevalence of the behavior is high enough that changes in behavior will be observed.

The follow-up period for this evaluation is such that almost two-thirds of the study sample will be 14 to 18 years of age by the time of wave 3 followup and no youth will be younger than age 12.  Even with the extended follow-up period, however, only six percent of the study sample will have reached ages 18 and 19, when over half their peers are expected to be sexually active.  To address the potential need for even longer followup, the data collection procedures and plans for the evaluation are designed to accommodate longer followup, if resources were to become available.