Updated: Jun 25, 2020
Large-Scale Replications show that PET-PEESE succeeds in reducing bias
(A) crucial experiment (is) one which is designed to bring about a decision between two competing theories by refuting (at least) one of them — K. Popper, Logic of Scientific Revolutions (p. 277)
Publication bias (aka selective reporting, specification searching and questionable research practices, QRP) has long been known to pose serious threats to scientific knowledge. It is the likely cause for the current replication crisis (Open Science Collaboration, Many Labs2) and the related loss of confidence in the social sciences.
Publication bias is the cause social science’s current crisis of confidence
We conservatively estimate that publication bias exaggerates economic findings by at least a factor of 2, making research inflation rate = 100+% —Ioannidis, Stanley, and Doucouliagos (2017). And, economics research is likely to be notably more exaggerated that this because we calculate research inflation relative to the weighted average of the adequately powered (WAAP) that is known to be consistently biased, upward, when there is selective reporting for statistical significance (Stanley, Doucouliagos, and Ioannidis, 2017; Stanley and Doucouliagos, 2019). Needless to say, policy will often be disappointedly ineffective when based upon highly inflated estimates of impact (Doucouliagos, Paldam and Stanley, 2018). Unfortunately, conventional meta-analysis is little help. With selective reporting, it too will be quite biased and very likely to confirm effects that do not exist (aka, false positives) (Stanley and Doucouliagos, 2014; 2015; 2019; Stanley, 2017, Stanley, Doucouliagos and Ioannidis, 2017).
Fortunately, there are methods that accommodate publication bias and lessen its pernicious effects on the scientific research record. One such method is PET-PEESE, which been shown in dozens of simulations and hundreds of applications to reduce research exaggeration when present in the research record (Stanley, 2005; Stanley, 2008; Stanley and Doucouliagos, 2014; Ioannidis et al., 2017). But there are others. Of special note and recent prominence are the family of maximum likelihood selection models, notably, the three parameter selection model (3PSM) of Hedges and Vevea (1996) and its more recent variant published by Isaiah Andrews and Max Kasy (2019) in the AER. However, all methods of correcting publication bias after the fact have their limitations. A recent simulation study of alternative publication bias methods concludes:
“Our results clearly indicated that no single meta-analytic method consistently outperformed all others” (Carter et al., 2018).
All methods are subject to real limitations (Stanley, 2017; Stanley and Doucouliagos, 2019). The replication crisis and famous cases of fraud and data distortion have led to a widespread outcry for preregistration, with which I agree.
Preregistration is not enough
But preregistration, alone, cannot solve the problems of selective reporting. Preregistration and registries of clinical trials have long been required in medical research. Yet, in spite of these efforts, those findings that do get published are often selected to be larger and more significant through changing how outcomes are measured, focusing on subpopulations, or by altering the way in which attrition is handled (Turner et al., 2008; Redmond et al., 2013). PET-PEESE has been shown to offer an accurate correction of published medial trials selected from a mandatory registry of clinical trials (FDA) and manipulated to give the impression that antidepressants are more effective than they actually are (Moreno et al., 2009).
Meta-analysis will be with us for a long time to come; thus, it is important to understand how alternative meta-methods actually work in practice, how to identify their weaknesses and how to adjust for them. Simulations can tell us much about the behavior of meta-methods when designed to mirror what we know about the relevant research record. But they rarely do so. When designed in a selective way, they tell us only what we wish to hear.
Simulation are not enough; too easily designed to find what they seek.
In particular, simulations of selection models have always baked in 3PSM’s assumption that: there is (are) sizable step(s) in the probability that a result is published at a priori known p-values (e.g., .025), these probabilities are constant, and that these steps are very large, 75% and larger—see Carter et al. (2018, Figure 2). However, we know that there are no such large steps in the frequency of reported results at the celebrated ‘stars.’ See Stephan’s Bruns’ graph of 64,000 economic effects and their t-values and note the very small jump of frequencies at the conventional .05 significance level.
As a result, selection models are widely known to have difficulties identifying their selection parameters, and their developers and advocates suggest that they be used only for sensitivity analysis (Marks-Anglin & Chen, 2020). While maximum likelihood methods are known to have desirable properties in large samples, they are also widely known to be unreliable in small, realistic samples. This is especially true when their assumptions do not strictly hold, and we know they do not. In particular, the key assumption of 3PSM is that the probability that a non-significant finding is reported must be constant. But we know that a large preregistered experiment is likely to be published regardless of its statistical significance—think preregistered, multi-lab replications. Even its developers know that these ‘constant’ probabilities depend on a nearly unlimited number of study characteristics:
“the source of funding for the research, the time when the research is conducted and corresponding social preferences, whether a result is the primary or secondary focus of a study, whether research is conducted at a single center or at multiple centers, and even the gender of the principal investigator” (Coburn and Vevea, 2015, p. 310).
Because many of these characteristics will remain unobservable in almost all actual applications, selection models will be mis-specified and their estimates routinely biased. Yet, all simulations of 3PSM have assumed exactly what 3PSM requires to work well: knowledge of exactly where the steps occur, that these steps are large, and that the probabilities of selection are constant and independent of any other factor.
Preregistered, multi-lab replications provide the crucial test of meta-analysis methods
The recent preregistered, multi-lab replication (PMR) projects (Open Science Collaboration, Many Labs, etc.) furnish more objective and reliable proxies for the true mean effect sizes. Because PRMs were not conducted to favor one method over another, they offer the crucial test of the effectiveness of alternative meta-analysis methods to correct selective reporting bias after the fact. PMRs have very large sample sizes, ensuring high power, and are widely regarded to have little or no selection bias.
PET-PEESE: Much better than I dared to hope
Kvarven, Strømland, and Johannesson (2019b) conducted a systematic review of all meta-analyses of psychological experiments that a PMR attempted to replicate and identified 15 meta/PMR connected pairs. The beauty of this approach is that PMR provides a more-or-less objective proxy for these 15 mean effects. In their first draft, publicly posted, Kvarven et al. compared only the differences of conventional meta-analysis estimates (crucially, without any explicit correction for publication bias – think random and fixed effects) and the associated replicated effect sizes and concluded that:
“These differences are systematic and on average meta-analytic effect sizes are about three times as large as the replication effect sizes” (Abstract). “Our findings suggest that meta-analyses is ineffective in fully adjusting inflated effect sizes for publication bias and selective reporting. A potentially effective policy for reducing publication bias and selective reporting is pre-registering analysis plans prior to data collection” (Kvarven et al., 2019a, p.13)
Given the analysis in the paper, one could simply observe instead that publication bias tends to exaggerate mean effects threefold. Note that Kvarven et al. (2019a) are only interested in these differences, hence only in the biases of meta-analyses. Initially, no other statistical criterion was needed to support the authors’ conclusion. After their revisions for Nature, their findings substantially changed as did the outcome measures used to evaluate them, but their conclusions did not. In the revision process, they added calculations and comparisons of Trim&Fill, 3PSM and PET-PEESE, which is to say methods that address publication bias.
Kvarven et al. (2019b) found that PET-PEESE has no bias (average difference = -.01; Cohen’s d) (Kvarven et al., 2019b, Table 1) and only one false positive, results very different than their initial findings,
“on average meta-analytic effect sizes are about three times as large as the replication effect sizes.”
In only one case of seven where the replication was not statistically significant (defined as a “failed replication”), PET indicated a statistically significant effect (‘false positive’). Because these are the very issues (bias and false positives) that are the reasons for the replication/credibility crisis, Kvarven et al. (2019b) demonstrate in the clearest possible way that PET-PEESE is a viable and practical remedy for the limitations of conventional meta-analysis as they exist in psychological research, today. But how did the other meta-methods do?
Random effects’ bias = .26 d, and its rate of false positives is 100%. That’s right, random-effects (RE) are always statistically significant, always supportive of conventional theory, regardless of whether the replicated effect size is positive or negative, significant or not. And, this is not some quirk of the 15 experiments investigated. When simulations are calibrated to reflect the key research dimensions found in large surveys of psychology (Stanley, Carter, and Doucouliagos, 2018), random-effects have very large bias under typical conditions (.2587, which, within rounding, is exactly what is found here) and high rates of false positives (95% for k=20, 100% for k=80)—see Stanley (2017, Table 1). Yes, conventional meta-analysis methods (fixed and random effects) are routinely wrong and/or highly exaggerated under normal conditions.
So, what about other pub’bias corrections? Trim&Fill is not notably better than RE: bias= .24 d & false +=100%, and likewise for 3PSM : bias= .23 d & false + = 86%. PET-PEESE is also the most efficient because it has the lowest MSE (Kvarven et al., 2019b, Table 1). For biased estimators (and we know that three of these methods are biased), mean squared error (MSE) is the proper criterion upon which to evaluate statistical efficiency (and therefore power). The estimator with the smaller MSE is said to be relatively more efficient (Spanos, 1986, p. 234). MSE is calculated by adding the squared bias to the variance, thereby making the proper statistical tradeoff for a reduction bias against the cost of increased variance. On all conventional statistical criteria (bias, efficiency, and type I errors), large-scale, preregistered, multi-lab replications show that PET-PEESE dominates alternative meta-analysis methods.
Much, so much, better
Or, as a disinterested researcher put it:
“Tom Stanley and Hristos Docouliagos can be super proud of their baby: it is really rare to see such a beautiful performance for a statistical correction procedure in the social sciences!” (Chabé-Ferret, 2020).
So, with such wonderful findings about the practical feasibility of accommodating publication bias, after the fact, and recovering the underlying true mean effects sizes, on average, what are Kvarven et al.’s (2019b) conclusions?
We find that meta-analytic effect sizes are significantly different from replication effect sizes for 12 out of the 15 meta-replication pairs. These differences are systematic and, on average, meta-analytic effect sizes are almost three times as large as replication effect sizes. We also implement three methods of correcting meta-analysis for bias, but these methods do not substantively improve the meta-analytic results (Kvarven et al., 2019b, Abstract).
This summary is demonstrably false by the authors’ own calculation. If driving substantial biases (.26 d) to nothing, eliminating 86% of the false positives, while at the same time reducing MSE by 50% (from .096 to .048, Table 1) is not a substantial improvement, then nothing is.
Eliminating bias, removing 86% of false+, and reducing MSE by 50% is Substantial!
No matter what other limitations PET-PEESE might have (and it does have a few), this is a success story unmatched in the history of publication bias and meta-analysis methods. To their credit, Kvarven et al. (2019b) do admit that PET-PEESE reduces bias but immediately take this away and suggest that no statistical method is up to the task.
“PET-PEESE does adjust effect sizes downwards, but at the cost of substantial reduction in power and increase in false-negative rate. These results suggest that statistical solutions alone may be insufficient to rectify reproducibility issues.”
The problem is that most readers will only see the abstract or perhaps these concluding sentences and get an entirely incorrect impression. As it stands, any researcher who does not like what PET-PEESE finds about their favored theory can cite this paper, erroneously, as providing evidence that PET-PEESE somehow failed this crucial test of large-scale preregistered replications or somehow failed in its goals to correct bias and reduce false positives. By the way, reducing bias is all that we ever claimed for PET-PEESE, and “Comparing meta-analyses and pre-registered multiple labs replication projects” proves that it succeeds in reducing the bias already contained in the research record far beyond our greatest hope.
 PET-PEESE’s reliability can be compromised by: very high heterogeneity among reported results, consistently low power of published findings (compressing the distribution of SEs) and small research bases (k<10) (Stanley, 2017). But these are limitations already in the research record. For the last four years, my MAER-Net talks have centered on these limitations and how they can be overcome. For example, see our IZA working paper, “Practical significance, meta-analysis and the credibility of economics.”
 We take these issues quite seriously and have 2 additional papers under review that address bias and false positive meta-analyses in psychology. However, a full discussion of these issues is beyond the scope of this brief blog.
 The false positive rates reported here are not exactly what is reported in Kvarven et al. (2019b). They report 3PSM’s false positive rate to be 100% but we find it to be 86% using their posted data, codes and results, and they report PET-PEESE’s rate to be 16% which is impossible when there are 7 statistically non-significant replications.
 Again, we refer the reader to a disinterested party, Sylvain Chabé-Ferret. Prof. Chabé-Ferret notes correctly that PET-PEESE’s bias reduction does come at a price—higher standard errors and wider CIs. But this is the expected price that any bias reduction would be expected to pay, and the relevant statistical criterion, MSE, shows that the price is not too high because MSE still falls. The criteria discussed above: bias, MSE and false positives (aka, type I errors) are the exact criteria that dozens of methods papers have used to evaluate the properties of alternative statistical meta-methods. The only other criterion that has been used to evaluate meta-analysis methods is coverage rate (i.e., how often the true effect falls within the computed confidence interval). Here, PET-PEESE also has the best coverage rate among these alternative methods, although Kvarven et al. (2019b) do not directly report them. The other criteria that Kvarven et al. (2019b) use are not appropriate for comparisons of meta-analyses to large-scale replications, but that would be another blog.
Andrews, I. & Kasy, M. (2019). Identification of and correction for publication bias. American Economic Review 109, 2766–94.
Carter, E. C., Schönbrodt, F. D., Gervais, W. M., & Hilgard, J. (2018). Correcting for bias in psychology: A comparison of meta-analytic methods. Advances in Methods and Practices for Psychological Science. Preprint. https://osf.io/jn6x5/ Accessed 6/22/2020.
Chabé-Ferret, S. (2020). “How Large Is Publication Bias and Can We Correct for It?” An Economist's Journey, January 11, 2020, accessed 6/19/2020.
Coburn, K. M. & Vevea, J. L. (2015). Publication bias as a function of study characteristics. Psychological Methods, 20(3), 310-330.
Doucouliagos, H. Paldam, M. and T.D. Stanley (2018). Skating on thin evidence: Implications for public policy. European Journal of Political Economy. 54:16-25 https://doi.org/10.1016/j.ejpoleco.2018.03.004
Hedges, L. V. & Vevea, J. L. (1996). Estimating effect size under publication bias: Small sample properties and robustness of a random effects selection model. Journal of Educational and Behavioral Statistics, 21(4),299–332.
Ioannidis, J.P.A., Stanley, T.D. and Doucouliagos, C. (2017). The power of bias in economics research, The Economic Journal, 127: F236-265.
Kvarven, A., Strømland, E. & Johannesson, M. (2019a). Comparing meta-analyses and pre-registered multiple labs replication projects. https://osf.io/brzwt, accessed 6/19/2020.
Kvarven, A., Strømland, E. & Johannesson, M. (2019b). Comparing meta-analyses and preregistered multiple-laboratory replication projects. Nature: Human Behavior. https://doi.org/10.1038/s41562-019-0787-z.
Many Labs2: Klein, R. A., Vianello, M., Hasselman, F., Adams, B. G., Adams, R. B., Jr., Alper, S., … Nosek, B. A. (2018). Many Labs 2: Investigating variation in replicability across sample and setting. https://doi.org/10.31234/osf.io/9654g , accessed June 20, 2020.
Marks-Anglin, A. & Chen, Y. (2020). A historical review of publication bias. https://osf.io/preprints/metaarxiv/zmdpk/, accessed June 20, 2020.
Moreno SG, Sutton AJ, Turner EH, Abrams KR, Cooper NJ, Palmer TM, Ades AE. (2009). Novel methods to deal with publication biases: secondary analysis of antidepressant trials in the FDA trial registry database and related journal publications. BMJ 339:b2981: 494–98. DOI:10.1136/bmj.b2981.
Open Science Collaboration (2015). Estimating the reproducibility of psychological science. Science, 349(6251), aac4716–aac4716. doi:10.1126/science.aac4716.
Redmond S, von Elm E, Blümle A, Gengler M, Gsponer T, Egger M. Cohort study of trials submitted to ethics committee identified discrepant reporting of outcomes in publications. Journal of Clinical Epidemiology 2013; 66:1367–1375.
Stanley, T.D., 2005. Beyond publication bias, Journal of Economic Surveys 19: 309-45.
Stanley, T.D. 2008. Meta-regression methods for detecting and estimating empirical effect in the presence of publication bias. Oxford Bulletin of Economics and Statistics 70:103-127.
Stanley, T.D. (2017). Limitations of PET-PEESE and other meta-analysis methods. Social Psychology and Personality Science, 8: 581–591.
Stanley, T. D and Doucouliagos, C. 2014. Meta-regression approximations to reduce publication selection bias, Research Synthesis Methods 5: 60-78.
Stanley, T.D. and Doucouliagos, C. (2015) Neither fixed nor random: Weighted least squares meta-analysis, Statistics in Medicine 34: 2116-27.
Stanley, T.D., Doucouliagos, C. and Ioannidis, J.P.A. (2017). “Finding the Power to Reduce Publication Bias,” Statistics in Medicine, 36: 1580-1598.
Strømland, E. (2019). Preregistration and reproducibility. Journal of Economic Psychology, 75:102143.
Turner EH, Matthews AM, Linardatos E, Tell RA, Rosenthal R. Selective publication of antidepressant trials and its influence on apparent efficacy. New England Journal of Medicine 2008; 358:252–260.
Spanos, A. (1986). Statistical Foundations of Econometric Modelling. Cambridge University Press.