Can we do small n impact evaluations?

Howard White 28 August 2012

3ie was set up to fill ‘the evaluation gap’, the lack of evidence about ‘what works in development’. Our founding document stated that 3ie will be issues-led, not methods led, seeking the best available method to answer the evaluation question at hand. We have remained true to this vision in that we have already funded close to 100 studies in over 30 countries around the world. And we strongly promote mixed methods, in which the attribution analysis of ‘what works’ is embedded in a larger evaluation framework combining process and impact evaluation, factual and counterfactual analysis. This helps unpack the causal chain to understand why an intervention works or not, or only works for certain people in certain places.

Although we promote mixed methods, 3ie only funds studies which have at their core a large n impact evaluation, that is an experimental or quasi-experimental design with sufficient units of assignment to attain the statistical power necessary to use such a design. But many development interventions, such as the support of policy reform at the national level, or capacity building in a single agency, or indeed the assessment of whether a particular impact evaluation has influenced policy, are small n questions. That is, there are insufficient units of assignment to conduct statistical analysis of what difference the intervention has made.

So how do we do small n impact evaluation? While there is no shortage of proposed approaches, we don’t have a consensus. 3ie has processed over 700 proposals to conduct large n impact evaluations. The more than 200 external reviewers we have consulted to screen these proposals are largely in agreement as to what they are looking for: a credible identification strategy, sufficient power and so on. But if we were to put ten evaluators in a room to screen proposals for small n studies it is doubtful they would agree which were the best designs, or even what constitutes a good design.

Despite this lack of consensus, I had a strong feeling that there was in fact a common core to the bewildering array of competing methods, such as realist evaluation, general elimination methodology, process tracing, and contribution analysis. Together with Daniel Phillips, also at 3ie, I undertook a journey to the centre of small n methods. And to an extent we found what we were looking for. The four approaches just mentioned all stress the importance of clearly defining the intervention being evaluated and the underlying theory of change. Changes in outcomes of interest should be documented, along with other plausible external explanations for observed changes in these outcomes. These pieces of evidence are assembled to establish plausible association.

All this was well and good, but we were missing something. I kept coming back to the question of “but what constitutes credible causal evidence?” There was no explicit answer to this question. At best we are told “to use mixed methods”, which is not much more informative than saying “we will use methods”. In seeking an answer to this question I got drawn into cognitive psychology in which the study of attribution, or more precisely people’s ability to correctly ascertain causality, is a separate field of study. And the news is not good. Basically people are not very good at assessing attribution, with the ‘fundamental error of attribution’ being the centre piece of the literature. The fundamental error is that we more readily identify people as the cause of change rather than underlying circumstances, hence creating a bias to overstate the role of interventions rather than underlying social trends. An exception is the self-serving bias: when things go right we take the credit, but when they go wrong other factors are to blame. World Bank assessments of adjustment lending (good macro performance = policies work, bad macro performance = the government didn’t carry out reform properly) are an example of this bias.

The biases go on and on. Our paper ‘Addressing attribution of cause and effect in small n impact evaluations: towards an integrated framework’ has the full list of these biases. But, in summary, potential biases can arise when collecting qualitative data, in deciding which questions are asked, in what order, how they are asked and how the replies of the respondents are recorded. There can also be biases in how the responses are interpreted and analyzed, and finally which results are chosen for presentation. Of course quantitative data and analysis is also prone to bias, such as sampling bias and selection bias. But methodologies have been developed to explicitly deal with these biases. Indeed evaluation designs are judged on precisely how well they deal with these biases.

We need to be issues-led, not methods led. But we need stronger methods, and agreement on those methods, in order to be able to judge small n interventions with confidence.

Read 3ie Working Paper Addressing attribution of cause and effect in small n impact evaluations: towards an integrated framework.