Towards a Credibility Revolution: Why successful replication remains unlikely

Updated: Nov 13, 2018

by T.D. Stanley and Chris Doucouliagos

Power Failure

Recent meta-science studies find that psychology is typically 4 times more powerful than medical research (9%), and its median power is twice as large as economics (18%).[1, 2, 3] Yet, only 8% of psychological studies are adequately powered. Statistical power is the probability that a study of a given precision (or sample size) will find a statistical significant effect. Following Cohen, adequate power (80%) has been deemed a pre-requisite of reliable research (see, for example, the APA Publication Manual). With statistical power so low how is it possible that the majority of published findings are statistically significant? [4] Something does not add up.

The Incredible Shrinking Effect

When 100 highly-regarded psychological experiments were replicated by the Open Science Collaboration, the average effect size shrank by half.[5] It shrank in half again when 21 experiments published in Nature and Science were replicated.[6]  Size matters.  In economics, a simple weighted average of adequately-powered results is typically one-half the size of the average reported economic effect, and one-third of all estimates are exaggerated by a factor of 4. [3] However, low power and research inflation are the least of social sciences’ replication problems.  

On the Unreliability of Science - Heterogeneity

What meta-analyses reveal about the replicability of psychological research

demonstrates that high heterogeneity is the more stubborn barrier to successful replication in psychology.  Even if a replication study were huge, involving millions of experimental subjects and, thereby, having 100% power, typical heterogeneity (74%) makes close replication unlikely. Then, the probability that the replicated experiment will roughly reproduce some previous study’s effect (say, one between .2 and .5 standardized mean differences) is still less than 50%.[1]  Heterogeneity is the variation among ‘true’ effects; in other words, it measures the differences in experimental results not attributable to sampling error. Supporters of the status quo are likely to point out that the high heterogeneity that this survey uncovers includes ‘conceptual’ as well as ‘direct’ replication.  True enough, but large-scale replication efforts that closely control experimental and methods factors (e.g. the Registered Replication Reports and the Many Labs projects) still report sufficient heterogeneity to make close replication unlikely.[1,7]

This is not to argue that large-scale, careful replication should not be undertaken.  Indeed, they should! They are often the best scientific evidence available to the social and medical sciences.  Unfortunately, such large-scale multi-lab replication projects are feasible for only a relatively few areas of research where studies can be conducted cheaply and quickly.

Enter Meta-Analysis

For some decades, meta-analyses that collect and analyze all relevant research evidence were seen to be the best summaries for research evidence and the very foundation of evidence-based practice (think the Cochrane and Campbell Collaborations).  As reported in a recent Science article, meta-analysis has also been dragged into the credibility crisis and can no longer be relied upon to settle all disputes.  After all, that’s a pretty high bar!  Unfortunately, conventional meta-analysis is easily overwhelmed by high heterogeneity when accompanied with some degree of selective reporting for statistical significance. Even when the investigated social science phenomenon does not truly exist, conventional meta-analysis is virtually guaranteed to report a false positive.[8] And, no single publication bias correction method is entirely satisfactory.[8,9]

The Way Forward

With crisis comes opportunity.  In a recent authoritative survey of the credibility of economics research, Christensen and Miguel (2018) emphasize transparency and replication as the way forward.[10]  We believe that the current discussion of ‘crisis’ can be transformed into a credibility revolution if a consensus can be formed about taking a few feasible steps that strengthen and clarify our research practices. Such steps might include:

1. Carefully distinguishing exploratory from confirmatory research Both types of investigations are quite valuable. The central problem of the decades-long statistical significance controversy is that exploratory research is presented in terms of statistical hypothesis testing as if it were confirmatory. Yet, early research that identifies where, how, and under which conditions some new phenomenon is expressed is essential. If only it could be presented and published for what it is without the pretense of hypothesis testing.  After some years of exploration, a meta-analysis could be used to access whether the phenomenon in question merits further confirmatory study. If so, a confirmatory research stage should be undertaken where adequately-powered, pre-registered studies that employ classical hypothesis testing are not only highly valued but expected. During this confirmatory research stage, transparency is essential. 

2. Supporting large-scale, pre-registered replications of mature areas of research. Large-scale, pre-registered replications are especially valuable during the confirmatory stage of social science research. Thankfully, these efforts have already begun but need greater funding and more visibility in our best scholarly journals.  

3. Emphasizing practical significance over statistical significance. Much of the debates across the social sciences would disappear if researchers agreed upon how large some effect needed to be in order be worthy of scientific or practical notice — i.e. ‘practical significance.’  The problem is that the combination of high heterogeneity and some selective reporting of statistically significant findings (because the current paradigm values them) makes it impossible for social science research, no matter how rigorous and well-conducted, to distinguish a very small effect from nothing.  Identifying ‘very small’ effects reliably is simply beyond social science.  However, meta-analysis can often reliably distinguish a ‘practically significant’ effect (say, 0.1 Cohen's d or 0.1 elasticity) from a zero effect even under the severe challenges of high heterogeneity and notable selective reporting bias.  

With a few modest, but real, changes, genuine scientific progress can be made. 


1. Stanley, T.D., Cater, E. and Doucouliagos, H. (2018). What meta-analyses reveal about the replicability of psychological research. Psychological Bulletin. 2. Lamberick et al. (2018) Statistical power of clinical trials increased while effect size remained stable: an empirical analysis of 136,212 clinical trials between 1975 and 2014. Journal of Clinical Epidemiology, 102: 123-128. 3. Ioannidis, J. P. A, Stanley, T. D., & Doucouliagos, C(H). (2017). The power of bias in economics research. The Economic Journal, 127: F236-265. doi:10.1111/ecoj.12461 4. Brodeur, A., Le, M., Sangnier, M., and Zylberberg, Y. (2016). Star Wars: The empirics strike back. American Economic Journal: Applied Economics, 8:1-32. 5. Open Science Collaboration (2015). Estimating the reproducibility of psychological science. Science, 349(6251), aac4716–aac4716.  doi:10.1126/science.aac4716 6. Camerer et al. (2018). Evaluating the replicability of social science experiments in Nature and Science between 2010 and 2015. Nature Human Behaviour, . 7. McShane et al. (2018). Large scale replication projects in contemporary psychological research. The American Statistician, forthcoming.   8. Stanley, T. D. (2017). Limitations of PET-PEESE and other meta-analysis methods. Social Psychology and Personality Science, 8: 581–591. 9. McShane, B. B., Böckenholt, U. & Hansen, K. T. (2016). Adjusting for publication bias in meta-analysis: An evaluation of selection methods and some cautionary notes. Perspectives on Psychological Science, 11: 730–749. 10. Christensen, G. and Miguel, E. (2018). Transparency, reproducibility, and the credibility of economics research. Journal of Economic Literature, 56: 920–80.

973 views5 comments

Recent Posts

See All

As you know, empirical studies in economics often use very similar (or even the same) data, causing empirical outcomes to be highly correlated. Heiko Rachinger and I recently published a paper in Rese