Requiring fuel gauges: A pitch for justifying impact evaluation sample size assumptions
We expect researchers to defend their assumptions when they write papers or present at seminars. Well, we expect them to defend most of their assumptions. However, the assumptions behind their sample size, determined by their power calculations, are rarely discussed. Sample sizes and power calculations matter. Power calculations determine sample size requirements, which match budget constraints with minimum sample size requirements. If the sample size is statistically too small, then evaluators are increasing the risk of making mistaken conclusions regarding the effectiveness of interventions (see here for implications of carrying out underpowered evaluations).
Power calculations are performed a few different ways. After reviewing numerous 3ie grant requests, we’ve learned some lessons about the key power calculation parameters, namely the minimum detectable effect and the intracluster correlation coefficient (ICC).
The choice of the minimum detectable effect is a critical part of the power analysis. The smaller the study’s minimum detectable effect, the larger the required sample size to ensure sufficient power. When the variables being studied have intrinsic meaning (income, production per hectares, etc.), as in many cases in development economics, the minimum detectable effect should simply be the expected raw difference between the population mean of the experimental group and the population mean of the control group.
Unfortunately, many economists are not following this approach. In many proposals we receive, the minimum detectable effect is standardized or expressed as the minimum detectable effect size, with effect size being the difference in the population means of the two groups divided by the standard deviation of the outcome of interest. As a result, the minimum detectable effect size is reported in units of standard deviations. This standardization trend should change. Economists should base their interventions both on the ability to detect the relevant minimum level of impact and also on the cost effectiveness of the intervention.
Many proposals compound their minimum detectable effect sizes problems by borrowing Cohen’s classification system, where effect sizes of .20 are small, .50 are medium, and .80 are large. Little justification exists for applying this framework to economic research (see here for discussion of some methodological limitations of Cohen’s classification). It’s unclear both how Cohen’s classification became a rule of thumb for minimum detectable effect sizes and how it’s relevant to economics (or education, examples here and here).
Economists are using standardized effect sizes and Cohen’s classification system as a benchmark regardless. In a recent proposal that aimed to evaluate the impact of payment for environment services, the authors powered their study to detect a minimum of a 0.25 standard deviation in the number of hectares of natural forest conserved (the outcome variable of interest). This minimum detectable effect size corresponds to 0.35 hectares. The mean of natural forest at the baseline, which is 1.15 hectares, corresponds to a 30 per cent increase of natural forest. Although this increase seems quite substantial to us, the minimum detectable effect size of 0.25 standard deviations is considered small according to Cohen’s classification. Minimum detectable effect is more than just a number. Researchers should justify their assumptions by explaining how this minimum detectable effect is relevant for both the cost of the intervention and the impact on treatment populations.
Economists should look at how epidemiologists calculate minimum detectable effect sizes. Economists typically provide neither mean nor variance of outcome indicator in their publications. Public health researchers provide both of these variables. We attribute this reporting difference to economists using optimal design, which does not require mean and variance, to compute power calculations. Public health researchers conduct power calculating using formulas from Hayes and Bennett, which require mean and variance. Without these variables, it is impossible to present minimum detectable effect in terms of change in percentage. Presenting minimum detectable effect as a change in percentage allows researchers to judge the detectable magnitude of change due to the intervention. We therefore reiterate McKenzie’s call for including the assumed mean and variance of each outcome indicator when reporting power calculation measurements.
The ICC, which measures how similar individuals are within a cluster, is the other major power calculation assumption for cluster RCTs (here’s a related blog). As research moves from examining individuals to examining clusters of individuals, the similarity of those clusters must be accounted for when determining sample size requirements. It is important to accurately account for ICCs as the greater the similarity within a cluster, the greater the number of observations that are needed to adequately power the impact evaluation.
Like minimum detectable effect, ICC assumptions determine required samples sizes. These assumptions are also typically unjustified. We receive many proposals with ICCs seemingly pulled from thin air. One recent application based its ICC on a study conducted in a different country, which had a completely different socio-economic background. Another focused on research conducted over a decade ago, with no argument for its current validity. Methods to improve ICC estimates are evolving (repeated measurement appear to increase ICC accuracy). Ideally, researchers should use pilot surveying to calculate actual ICCs. If piloting the intervention is impossible, an alternative is to test multiple, realistic ICCs to determine the study’s power sensitivity and better understand sample size requirements.
Online power calculation appendices are a method to increase transparency in assumptions. As the number of studies reporting null results increases, appendices that include minimum detectable effect and ICC assumptions would allow researchers to assess whether null findings are due, or not, to a lack of power. Standardizing power calculation ‘fuel gauge’ reporting, through comprehensible minimum detectible effects and justified intra-cluster correlation coefficients, would improve the accuracy of social science impact evaluation research.