The needed sample size also depends on the level of precision at which the impact is to be measured. The precision standard for a sample design is determined by three factors: (1) the desired level of statistical significance for the impact estimate, (2) the power of the sample design (the probability of detecting the desired effect), and (3) whether a one-sided or a two-sided hypothesis test is used. A result is referred to as statistically significant if the probability of the true impact being zero, given the estimated impact and its standard error, is very low--generally 10 percent or less (typical standards are 10 percent, 5 percent, or 1 percent). For a given size impact, the smaller the standard error, the more statistically significant the estimate; larger sample sizes are thus required to detect an effect at the 1 percent level of significance than at the 5 percent level. The power of the design is the probability of detecting an effect, assuming an effect of a given size is present--for example, if the design has 80 percent power to detect a 5 percentage point impact at a 5 percent significance level, then, assuming the true impact of the program is 5 percentage points, the probability that a statistically significant impact will be observed is 80 percent. The larger the sample size, the higher the power of the sample to detect impacts of a given size and significance level.

Most evaluation research uses two-sided hypothesis tests, under the assumption that it is useful to distinguish effects in the desired or the unintended direction from policies with no effect. Bloom (1995) argued that one-sided tests may be adequate for most evaluations, since the key concern is to distinguish whether a policy had the desired effect or not. The advantage of one- sided tests is that smaller sample sizes are needed than in two-sided tests to achieve a given level of power and statistical significance.