Approaches to Evaluating Welfare Reform: Lessons from Five State Demonstrations. b. Precision Standard


The needed sample size also depends on the level of precision at which the impact is to be measured. The precision standard for a sample design is determined by three factors: (1) the desired level of statistical significance for the impact estimate, (2) the power of the sample design (the probability of detecting the desired effect), and (3) whether a one-sided or a two-sided hypothesis test is used. A result is referred to as statistically significant if the probability of the true impact being zero, given the estimated impact and its standard error, is very low--generally 10 percent or less (typical standards are 10 percent, 5 percent, or 1 percent). For a given size impact, the smaller the standard error, the more statistically significant the estimate; larger sample sizes are thus required to detect an effect at the 1 percent level of significance than at the 5 percent level. The power of the design is the probability of detecting an effect, assuming an effect of a given size is present--for example, if the design has 80 percent power to detect a 5 percentage point impact at a 5 percent significance level, then, assuming the true impact of the program is 5 percentage points, the probability that a statistically significant impact will be observed is 80 percent. The larger the sample size, the higher the power of the sample to detect impacts of a given size and significance level.

Most evaluation research uses two-sided hypothesis tests, under the assumption that it is useful to distinguish effects in the desired or the unintended direction from policies with no effect. Bloom (1995) argued that one-sided tests may be adequate for most evaluations, since the key concern is to distinguish whether a policy had the desired effect or not. The advantage of one- sided tests is that smaller sample sizes are needed than in two-sided tests to achieve a given level of power and statistical significance.