Post

Avatar
The replication crisis is often described as a crisis of false positives, but is it? In our (re)new(ed) preprint, we show that false positives do not explain the replication crisis, and that varying rep. rates are often explained by replication sample size. #metascience osf.io/preprints/so...
Avatar
False positives get tossed around a lot as a term, but they share a formal form in models of file-drawer sizes, definitions of QRPs, and models of why replications fail. They're Type I error, plain and simple. For example, Ioannidis 2005 journals.plos.org/plosmedicine...
Why Most Published Research Findings Are Falsejournals.plos.org Published research findings are sometimes refuted by subsequent evidence, says Ioannidis, with ensuing confusion and disappointment.
Avatar
A corollary of this is that if a given hypothesis is a *formal* false positive, it should be significant in 5% of multi-site replications. This is easily checked, and we show rates of significance for multi-site replications are inconsistent with false positives causing failed replications.
Avatar
This doesn't mean the hypotheses tested are true in any meaningful sense, only that the statistical null models being true is not the cause of replication failures. What happens if we model the publication process without assuming rare or non-existent formal false positives?
Avatar
We develop a minimal alternative model. We assume effect sizes are normally distributed as a result of sampling error, variation in studied effect sizes, and their heterogeneity. We model the chance of getting significance, replication, sign and mag. error.
Avatar
A cool thing about Bayesian Inference is that we can estimate parameters from data---can it tell us anything new or does it spit out gibberish? For example, our model predicts rates of significance for replication efforts will depend on the average observed effect size and sample size.
Avatar
Applying our model to replication efforts, we find that variation is rates of significance for replication efforts is almost entirely explained by the **replication** sample size. OSC 2015 used N=71, low replicability. Soto, Protzko etc.. used N>1000.... high replicability!
Avatar
If you're wondering why your favorite many labs study isn't in that chart above, there's two key things: Meta-analytic rates of multi-site replicability are distinct from single-shot estimates and each of the ML studies had sampling frames intentionally including strong/weak effects. Check the SI
Avatar
One study, ML5 is worth talking about because it was intended to address criticisms that small sample sizes explain low rates of replication. They used N=500 and retested failed reps from OSC 2015. Our model shows their outcomes are entirely consistent with rep N determining sig. rates.
Avatar
Avatar
Why does all this matter? What's the practical difference between a tiny effect and a perfectly zero one? Our model suggests significance is more common, which means smaller file-drawers and fewer QRPs. With N=100, significance should occur ~40% of the time (it's < 10% with a FP model).
Avatar
It also suggests that high rates of replication can be achieved by simply increasing *replication* sample size. In a sense this is a trivial solution to the replication crisis but achieves its goals only by dredging up significance for tiny and perhaps meaningless effects.
Avatar
And we might not want to just increase sample size for the sake of replicability. Doing so increases sign error, although it will decrease magnitude error. Large-N research may be significant nearly all the time and highly replicable, but more likely to get the direction wrong...
Avatar
Putting this all together, our results suggest that the replication crisis does not arise from false positives as a result of large file-drawers and abundant QRPs. These exist, but the leading cause seems to be small effects with magnitude error from publication bias + insufficient rep sample sizes.
Avatar
Our model also suggests that reforms may only create the illusion of fixing the replication crisis without changing what hypotheses enter the literature, whether they're significant (at large N), increasing sign error and decreasing magnitude error. Reforms may be worthwhile, but for other reasons!
Avatar
Huge thanks to my collaborators @carlbergstrom.com, @jevinwest.bsky.social, Richard Mann and Kevin Gross.
Avatar
Avatar
Avatar
Thanks a lot for this nice thread. Would this imply that there would be benefits to value more replications of stronger effects (effects that require fewer participants to have good power for a replication)? This was a potential implication of the replication database.
Avatar
Avatar
I think it would depend entirely on the inferential goals and broader context. Small absolute effects can be really important, especially when it's something that scales (e.g., social media impacts, policy impacts). That said, replicating to merely to observe significance again seems less useful.
Avatar
Avatar
This is really interesting! Thanks for sharing. Just a quick note: I was trying to find a link to the repo in the manuscript but couldn't get there because the URL cuts off the page.