Meta-Analysis and Research Credibility
Welcome back to MAER-Net’s blog and discussion. I invite researchers interested in meta-analysis and the credibility of the social and medical sciences to use this space for a discussion of ideas and relevant research. It is my intention to intermittently post here about: new methods that we have developed, what these methods may tell us about the credibility of research, and how research practice might change to increase the scientific value of research. I invite you to comment and to post your research about these issues. See “MAER-Net Blog – Introduction,” above, to learn how to register, get updates, and join the discussion.
Past as Prologue
MAER-Net has had several productive web discussions. MAER-Net’s Reporting Guidelines were written after helpful and extensive web discussions. [1,2] John Ioannidis’ 2014 keynote at MAER-Net’s Athens Colloquium triggered another. Most will be aware of John’s famous and provocative “Why most published research findings are false.”  His 2014 keynote pivoted from identifying what is wrong with the status quo to suggesting what can be done to increase the credibility of our sciences. 
John’s keynote and our web discussion about it caused Chris Doucouliagos and I to refocus our research on what we could do to assess empirically the key dimensions of research credibility and to try to identify what can be done to improve the scientific contribution of social science research. In John’s seminal “why most” paper, he correctly identified statistical power and the proportion of research that is biased (i.e., results that have somehow been selected or rigged to be statistically significant) as key dimensions for credibility. Over the last few years, we have developed several new methods and metrics to address these and have learned a lot through their wide applications across multiple disciplines. For now, I merely wish to briefly sketch what some of our broad meta-research findings have been about these disciplines. In future blogs, I will focus more sharply on specific findings and their implications. If there is interest, I can devote other posts to outline recently developed methods.
Along with others, we have found that statistical power is quite low in every discipline investigated. Typically, it is much lower than 80%, widely recognized as needed for reliable research. 
Economics: From > 64,000 estimates across 159 areas of research, we found that the median of the median of median powers is 11.5%.  After doubling this large collection of meta-analyses, we find that typical power is even smaller, especially at top economic journals. However, some specific areas of research are well powered and experimental economics, in general, has much higher power than observational research.
Political science: From > 16,000 estimates, we find median power to be 10%. 
Medical research: Lamberick et al. (2018) found typical power to be about the same as economics and political science, 9%, from 136,212 clinical trials. 
Psychology: From > 12,000 studies, the median of the median of median powers is 36%. However, even here, only 8% have power > 80%. 
In spite of notable exceptions, there is sufficient evidence widely across these disciplines to conclude that typical research results are not reliable. We can now answer the question posed by MAER-Net’s 2015 discussion, “Is economics research more likely wrong than true?”, in the affirmative. Although to document how we know that empirical economics research is more wrong than true would require further posts.
Regardless, power as low as typically seen is unacceptable. We must insist that statistical power of reported research be raised. Suggestions:
Mandatory power analysis should be required for the publication of empirical research. We now have mandatory data sharing at most journals. Realistic and replicable power analyses should also be required. Such prospective power analysis should be based on the estimated mean from a meta-analysis of studies that investigate the same phenomenon or effect. Exceptions can be made for pilot and exploratory studies, but these need to be explicitly identified as such.
Abandon zero testing. Everything is something, and nothing is scientifically uninteresting. Instead, researchers need to define the smallest effect of scientific (or policy) interest (SESI) and use this effect size as their null hypothesis. Having established SESI, prospective power analysis can be conducted using it.
Meta-analysis should code and separate exploratory and pilot studies. These studies are likely responsible for most of the publication selection bias and low power; thus, they should be ignored or separately investigated by meta-analysis.
The other dimension most responsible for research credibility (or its absence) is the proportion of research that have undergone a process of selection for statistical significance (SSS). Readers of this blog will be aware of the issues of publication bias and related behavior (p-hacking, selective reporting, QRP, etc.). Recent years have seen a renaissance of methods to identify and reduce these biases, and meta-research has told us much about their likely influence on research, which is to exaggerate reported effects size (i.e., ‘research inflation’) and its statistical significance (false positives). Here too, we have learned a lot about just how exaggerated typical research is likely to be.
Economics: The average reported empirical estimates are exaggerated by at least a factor of two relative to a simple weighted average of the adequately powered (WAAP), implying that research inflation (RI) is at least 100%.  WAAP is known to be exaggerated in the presence of SSS, but less so than the simple unadjusted mean. Bartos et al. (2022) corroborate this exaggeration factor of 2 (2.16) using a larger dataset and Bayesian model averaging. 
Environmental sciences: RI is estimated to be 78%. 
Medical research: Typical RI is 62%. 
Psychology: Typical RI is 39%. 
Such notable exaggeration can be quite distorting to scientific knowledge, especially in economics where half of the purported evidence for an effect is likely to be due to selection for statistical significance. Psychological research is less exaggerated because psychological effects are typically larger than medicine and economics. This also explains psychology’s higher median power.
What can be done?
Mandatory data sharing has already been widely implemented. Using difference-in-difference, our large database of economic meta-analyses shows that both statistical significance and excess statistical significance (a metric of SSS) were notably reduced by mandatory data sharing. 
Reproducibility and robustness editors should be assigned to scientific journals. With mandatory data and code sharing, it would be easy to reproduce the findings of research submitted for peer review. The editor assigned to do this could also offer sensible robustness checks to the reported analyses, making sure that the selected results were not cherry-picked. Incentives for these editors could be created to give them substantial academic credit and status. Perhaps, their robustness reports could be routinely published along with papers.
Increased transparency and preregistration. Much has been discussed about issues of registration and transparency in recent years. The Berkeley Initiative for Transparency in the Social Sciences (BITSS) serves as a clearinghouse for these efforts. Needless to say, MAER-Net supports these efforts and regards them as complements to meta-analysis and meta-research.
Although we still have a long way to go, real progress continues to be made.
1. Stanley, T.D. et al. “Meta-analysis of economics research reporting guidelines,” Journal of Economic Surveys, 27(2013): 390-94. https://onlinelibrary.wiley.com/doi/pdf/10.1111/joes.12008
2. Havránek, T., et al. (2020): Reporting guidelines for meta-analysis in economics. Journal of Economic Surveys, 34: 469-75. https://onlinelibrary.wiley.com/doi/full/10.1111/joes.12363
3. Ioannidis, JPA (2005) Why Most Published Research Findings Are False. PLoS Med 2(8): https://journals.plos.org/plosmedicine/article?id=10.1371/journal.pmed.0020124
4. Ioannidis JPA (2014) How to Make More Published Research True. PLoS Med 11(10) https://journals.plos.org/plosmedicine/article?id=10.1371/journal.pmed.1001747 .
5. Cohen, J. (1988) Statistical Power Analysis in the Behavioral Sciences (2nd ed.). New York, NY: Academic Press.
6. Ioannidis, J.P.A., Stanley, T.D. and Doucouliagos, C. (2017). The power of bias in economics research. The Economic Journal, 127: F236-265.
7. Arel-Bundock et al. (2022) Quantitative Political Science Research is Greatly Underpowered. https://www.econstor.eu/handle/10419/265531
8. Lamberick et al. (2018) Statistical power of clinical trials increased while effect size remained stable: an empirical analysis of 136,212 clinical trials between 1975 and 2014. Journal of Clinical Epidemiology, 102: 123-128.
9. Stanley, T.D., Cater, E. and Doucouliagos, H. (2018). What meta-analyses reveal about the replicability of psychological research. Psychological Bulletin.
10. Bartoš F et al. (2022). FOOTPRINT OF PUBLICATION SELECTION BIAS ON META-ANALYSES IN MEDICINE, ECONOMICS, PSYCHOLOGY AND ENVIRONMENTAL SCIENCES. https://arxiv.org/abs/2208.12334
11. Askarov, Z., Doucouliagos A, and Doucouliagos H., and Stanley, T.D. (2022). The significance of data-sharing policy. Journal of the European Economic Association. 1–36. https://doi.org/10.1093/jeea/jvac053