External validity: policy demand is there but research needs to boost supply
A randomised controlled trial (RCT) in a Northern district of Uganda finds that the young adults who receive cash transfers use it to buy more food for their families, football shirts, and airtime for their mobile phones, compared to those in control areas. Would the pattern be the same if young adults in central Uganda are given cash transfers? Would the findings replicate if the cash transfers were given to young women in Senegal? This stylised example points to the crucial question of generalisability of program impacts to other contexts – commonly referred to as external validity. A recent systematic review (Peters et al. 2018) sets out to check how RCT studies in leading economic journals deal with external validity hazards.
Four major hazards to external validity
According to Duflo and colleagues (2007), there are four major external validity hazards in RCTs: specific sample problems, special care, general equilibrium effects, and Hawthorne effects. The specific sample problem occurs if the study population is different from the policy population in which the intervention will be brought to scale. The special care problem refers to the fact that in many RCTs, the treatment is provided differently (e.g. by an NGO) than what would be done if it was brought to scale (implemented, e.g., by the government). General equilibrium effects are in most cases not captured in an RCT, because they become noticeable only if the program is scaled to a broader population or extended to a longer term. This is for example the case when prices or norms change as the program reaches a broader population. Hawthorne effects happen if people alter their behaviour in response to being part of a randomised experiment. In 2011, 3ie argued that RCTs due to their ex-ante and repeat surveying might be more vulnerable to Hawthorne-induced biases than non-experimental impact evaluations. Likewise, the special care hazard is unique to the controlled character of RCTs, where researchers often make sure the treatment is delivered effectively and does hence not occur in non-experimental studies. Therefore, it is frequently argued that RCTs – despite their high degree of internal validity – need to worry more about external validity.
How are these major hazards handled in the practice of impact evaluations?
The paper by Peters and colleagues (2018) systematically reviews how published RCTs have addressed and reported on these four major hazards. It includes all 54 RCTs which were conducted in a developing country and published between 2009 and 2014 in leading economics journals: American Economic Review, Quarterly Journal of Economics, Econometrica, Economic Journal, Review of Economic Studies, Review of Economics and Statistics, Journal of Political Economy, and American Economic Journal: Applied Economics. For each paper, some simple yes/no-questions were answered on how they deal with hazards to external validity. Short reports were also sent to the lead authors, asking them to review the answers. For 67 per cent of the papers a response was received, which confirmed the assessment for 97 per cent of all questions.
The systematic review finds that many published RCTs do not provide a comprehensive explanation of how the experiment was implemented. Two-thirds of the papers do not even mention whether the participants in the experiment are aware of being randomised—that can sometimes be obvious from the study’s context, but often it is not. Assessing whether Hawthorne effects might be at play is then impossible. One third of papers do not discuss general equilibrium effects. This is especially problematic when the intervention spills over to non-surveyed agents or even to the control group and leads to an incorrect assessment of the treatment effect. A discussion of the special care problem is rare (only 20 per cent), which is particularly concerning in a developing country context, where more than 60 per cent of RCTs are implemented by NGOs or researchers. They are arguably more flexible in terms of treatment provision than the government. By far the most commonly addressed hazard to external validity is the specific sample problem: 77 per cent of papers compare study and potential policy populations or at least discuss other potential obstacles to generalisability.
An earlier 3ie review of 2011 corroborates the Peters et al. results. This review was designed to examine how impact evaluations deal with Hawthorne effects and hence focused on studies that explicitly mention this term. These studies were extracted from various databases, including 3ie’s impact evaluation repository, using 3ie’s definition and inclusion criteria for impact evaluations. The review found only 11 studies that mentioned Hawthorne effects, of which 6 refer to it as a potential source of bias in the results, 5 argued the design of the experiment minimised the possibility of the Hawthorne bias, and 1 used it as an argument for using matching design (rather than randomisation). No study was found that estimated the Hawthorne effect.
How do we move forward?
3ie is emphasising attention to external validity during all stages of the evaluation. For example, 3ie is increasingly encouraging researchers to assess whether program designers have carried out research to properly diagnose the root causes of the problem that the program intends to address. This helps to avoid comparing the effectiveness of interventions addressing accurately diagnosed problems with those where there has been a mismatch between the diagnosis and treatment, as has been articulated in another 3ie blog.
3ie also stresses the importance of impact evaluations being theory-based, mapping out the causal chain of a development intervention from inputs to outcomes and impact. This helps to increase external validity and to reveal potential threats to it because a theory-based evaluation lays out and tests the underlying assumptions to answer the crucial question of ‘why’ a development programme should have an impact. It also requires a deep understanding of the contextual factors that facilitate or inhibit the pathway from intervention to impact (see Williams 2020; White 2009; Bates and Glennerster 2017). For replication of an intervention in a different context, it is important to assess whether the theoretical assumptions hold and how the intervention needs to be adapted. Besides theory-based impact evaluations, 3ie strongly advocates for systematic synthesis of evidence to inform policy rather than depending on any single impact evaluation and for including a range of quasi-experimental evaluation methods beyond RCTs.
Moreover, all proposals for impact evaluations submitted to 3ie are assessed regarding external validity and generalisability and 3ie’s reporting templates require researchers to discuss and report on the same. Likewise, 3ie’s systematic reviews account for the quality of reviewed impact evaluations by assessing the risk of bias, including from possible Hawthorne effects.
Findings from Peters and colleagues (2018) and a 3ie review of 2011 suggest that despite a rising awareness of external validity hazards, their discussion and reporting is often neglected in published impact evaluations. As the importance of these evaluations go beyond the research community and heavily affect decision-making, researchers should be required to discuss the external validity issues adequately and journals should make it a requirement. For example, flagship journals and associations like the American Economic Association could make a systematic reporting on external validity mandatory as a supplement to submitted papers. This would provide incentives for researchers to address external validity issues as rigorously throughout the study implementation and publication process as is done for other methodological issues.
- Duflo, E., Glennerster, R. and Kremer, M. (2007). Chapter 61 Using Randomization in Development Economics Research: A Toolkit. Handbook of Development Economics, pp.3895-3962.
- Gaarder, M., and Dixon, V. (2018). Misdiagnosis and the evidence trap: a tale of inadequate program design. [Blog] Evidence Matters. Available at: https://www.3ieimpact.org/blogs/misdiagnosis-and-evidence-trap-tale-inadequate-program-design.
- Gaarder, M., Masset, E., Waddington, H., White, H. and Mishra, A. (2011). Invisible treatments: placebo and Hawthorne effects in development programs, presentation at https://www.3ieimpact.org/sites/default/files/2019-12/Placebo-Hawthorne-3ie-presentation.pdf
- Jimenez, E. (2019). Be careful what you wish for: cautionary tales on using single studies to inform policymaking. [Blog] Evidence Matters. Available at: https://www.3ieimpact.org/blogs/be-careful-what-you-wish-cautionary-tales-using-single-studies-inform-policymaking.
- Mobarak, M., Levy, K., and Reimão, M. (2017). The path to scale: Replication, general equilibrium effects, and new settings. [Blog] VoxDev. Available at: https://voxdev.org/topic/methods-measurement/path-scale-replication-general-equilibrium-effects-and-new-settings.
- Muralidharan, K., and Niehaus, P. (2018). Why studies should be conducted on a larger scale. [Blog] VoxDev. Available at: https://voxdev.org/topic/methods-measurement/why-studies-should-be-conducted-larger-scale.
- Peters, J., Langbein, J. and Roberts, G. (2018). Generalization in the Tropics – Development Policy, Randomized Controlled Trials, and External Validity. The World Bank Research Observer, 33(1), pp.34-64.
- Williams M.J. (2020). External Validity and Policy Adaptation: From Impact Evaluation to Policy Design, The World Bank Research Observer, forthcoming.
- White, H. (2009). Theory-based impact evaluation: principles and practice. Journal of Development Effectiveness, 1(3), pp.271-284.