Moved my stuff out of my office today, entering a strange interstitial place between one position ending and hopefully the next. The stack of books on my coffee table awaiting a home on a shelf was a great reminder of how awesome the past decade in science has been.
Tentatively it *should* look something like this. Much more consistent with ML2 Table 2... ~4-5 hypotheses are consistent with FP properties but *many* of those that failed to replicate are not.
Check higher up in the paper as well. This isn't *our* definition of a false positive, this is the definition in the literature. Misleading about explanatory power is an informal definition (which is fine!) but informal definitions don't create testable claims about the abundance of FP in the lit.
We address this in the discussion, this is how commonly cited models define a false-positive and many inferences about the replication crisis are drawn from those models. But what does it mean to you for a hypothesis to be false?
Why does all this matter? What's the practical difference between a tiny effect and a perfectly zero one? Our model suggests significance is more common, which means smaller file-drawers and fewer QRPs. With N=100, significance should occur ~40% of the time (it's < 10% with a FP model).
It also suggests that high rates of replication can be achieved by simply increasing *replication* sample size. In a sense this is a trivial solution to the replication crisis but achieves its goals only by dredging up significance for tiny and perhaps meaningless effects.
And we might not want to just increase sample size for the sake of replicability. Doing so increases sign error, although it will decrease magnitude error. Large-N research may be significant nearly all the time and highly replicable, but more likely to get the direction wrong...
Applying our model to replication efforts, we find that variation is rates of significance for replication efforts is almost entirely explained by the **replication** sample size. OSC 2015 used N=71, low replicability. Soto, Protzko etc.. used N>1000.... high replicability!
If you're wondering why your favorite many labs study isn't in that chart above, there's two key things: Meta-analytic rates of multi-site replicability are distinct from single-shot estimates and each of the ML studies had sampling frames intentionally including strong/weak effects. Check the SI
One study, ML5 is worth talking about because it was intended to address criticisms that small sample sizes explain low rates of replication. They used N=500 and retested failed reps from OSC 2015. Our model shows their outcomes are entirely consistent with rep N determining sig. rates.
We develop a minimal alternative model. We assume effect sizes are normally distributed as a result of sampling error, variation in studied effect sizes, and their heterogeneity. We model the chance of getting significance, replication, sign and mag. error.
A corollary of this is that if a given hypothesis is a *formal* false positive, it should be significant in 5% of multi-site replications. This is easily checked, and we show rates of significance for multi-site replications are inconsistent with false positives causing failed replications.
Holy shit their website is a hoot. I love that alongside their description of being a K-mart Cambridge Analytica they feel the need to note they're GDPR compliant.
Many are sparse... no real good way to link this to their work and claims without manually going through a *lot* of preprints, papers, OSF repositories, etc.. with tons embargoed, private or wholly missing.
An interesting bit of discourse on the bad place between Craig Sewell, @stephenjwild.bsky.social and myself about Haidt's findings warrants a little thread here. The topic is this plot ostensibly showing an increase in self-harm hospitalizations for 14-18 y/o's. Obviously big if True..
An obvious problem with Haidt's graph is that it doesn't account for uncertainty. You can add the raw yearly uncertainty back in and you wind up with Craigs graph which tells a less compelling story.
Once we've added some uncertainty, we're modeling not plotting data. In this case we're assuming each year is independent and we wind up with quite wide estimates, undermining the ability to draw conclusions. @stephenjwild.bsky.social put together a spline fit to demonstrate the difference.
My contribution was a simple gaussian process model which enables us to extract a linear trend. It seems to suggest an increase, but by no means a slam dunk. Not a perfect model and doesn't directly get at the smartphone hypothesis, but demonstrates another way of viewing the data.