Will Wilkinson 🏳️‍🌈: Sam and Co have basically admitted that they’re out of data and need 5% of world GDP and cold fusion to really make a go of it. Plus the web is filling up with LLM outputs, which they can’t detect, and they’re scraping and training on that shit, which WILL make it worse.

This little two-step is really something. Running up to the launch of GPT-4, he was heralding an astonishing, transformative advance, the harbinger of some cosmic intelligence. Now we all see it’s mostly useless,so… hang onto your hats for the real Great Leap Forward! What a total scam.

'GPT-4 is the dumbest model any of you will ever have to use' declares OpenAI CEO Sam Altman as he bets big on a superingtelligencewww.tomsguide.com Sam Altman talks at Stanford.

Every time there’s a shiny demonstration of some new AI regurgitation engine, the zealots always say “and this is the worst AI is ever gonna be!” But it’s not! It really might not get substantially better and could definitely get worse.

Sam and Co have basically admitted that they’re out of data and need 5% of world GDP and cold fusion to really make a go of it. Plus the web is filling up with LLM outputs, which they can’t detect, and they’re scraping and training on that shit, which WILL make it worse.

The whole barely adequate current thing would not be possible without wholesale IP theft and they’re basically just praying that they’re immunized by the world historical scale of their crimes and don’t get sued into cinders.

But just you wait until the barely incrementally better version, which we spent the GDP of Morocco, ten million hours of hidden underpaid Filipino labor, and half the water in the Colorado river training. Just you wait!

I’m ranting like @zitron.bsky.social but good lord the millenarian bullshit coming out of these people just to trick some credulous CTOs into exorbitant Azure contracts really chaps my hide.

Their language around this stuff is so teleological that it totally gives the game away

There's a sucker born every nanoseconnd.

And it’s all such obvious bullshit and so many idiots still keep falling for it.

Did we ever get a good read on what happened the weekend they locked him out? There were some really cult-like stories going around.

No need to apologize, true facts deserve to be shouted from the rooftops

Well Altman is half right anyways; GPT4 is dumb.

I don’t understand how you think AI is “barely adequate.” At what? If its answers can’t be trusted (and they can’t), it has a negative value in every possible intended use case, doesn’t it? It’s only positive value comes from the fact that I needed a laugh.

If you just need to generate plausibly coherent lorem ipsum text to fill out, say, a web form, it seems like it might be useful for that.

So Utah, having passed a transphobic bathroom bill, has launched an online form for people to snitch on folks they think are in the "wrong" bathroom or locker room. Be a real shame if people on the Internet flooded it with fake reports: ut-sao-special-prod.web.app/sex_basis_co...

Hotline Complaint Formut-sao-special-prod.web.app

Why would I ever need to— Oh! Yeah, I could see that.

Would Jazz Fusion work?

Already produced more valuable output with 1/100,000 the budget…

Powering the World since the 1970s!

all the data scale in the world (or, I mean, greater data scale than is available in the world) wouldn't do it. The stuff it can't do now is largely because it's missing the bits to do that, "just throw more unstructured data at it" is pretty well at asymptote

if that is too brief and not jargon-y enough I wrote a little while back about what the brain can tell us about (some of) the bits that are missing: buttondown.email/apperceptive...

How LLMs are and are not like the brainbuttondown.email Hi from buttondown! At the bottom of this newsletter is a bit of administrivia about the new platform How LLMs are and are not like the brain Beneath all the...

either I am very dumb (possible) or the "data" includes, like, the collected works of Gateway Pundit? i mean just feeding it more blahblahblah doesn't make it "smarter"

the collected works of gateway pundit are actually sort of okay training data for what LLMs actually are, it's just that what they are is inevitably going to be bad at all kinds of tasks because the representational structures for doing those tasks doesn't really exist in written human language

yah but even the very simple things like "Give me a short biography of Barack Obama" isn't helped by loading it up with nonsense that includes "barack obama was born in indonesia and/or kenya"

Yes it’s effectively averaging out all the claims on the internet about Obama. The more disinfo there is, the more likely it is to repeat it.

i wonder how they train this, it will give you whatever answer is near the vibes of your question, which is maybe worse

it has no idea what is true or not and adding that facility is plausibly unsolvable but even gateway pundit moooost of the things they type on a sentence-to-sentence basis are, like, coherent english sentences; in the insane scale of these corpora it's okay-enough

Also - it’s almost always revealed to have a human hand involved. In other words part of the fancy AI isn’t just Langchain and statistics, the engineers had to go in and code the guy’s head because it kept popping off or tell it there is a country in Africa that starts with ‘K’.

Yeah it’s mechanical Turks all the way down

Hahahaha. Elon Musk’s dancing robot. ‘This isn’t a real robot, but you get the idea.’

😂 truly Fucking Amazon stores and so on

Plus the original natural language corpus was all ill got at the expense of privacy rights.

And Enron.

Yup, the more AI regurgitated-crap fills the internet, the worst the new AI regurgitating models will get. It's a vicious cycle of ever-worsening vomit. The internet was such a useful thing twenty years ago...

It’s not the case that adding synthetic data necessarily makes models worse. People outside the field love this premise, but it’s behavior that only shows up in experiments that make the most pessimistic assumptions (eg earlier data must be discarded at every generation) arxiv.org/abs/2404.01413

Is Model Collapse Inevitable? Breaking the Curse of Recursion by...arxiv.org The proliferation of generative models, combined with pretraining on web-scale data, raises a timely question: what happens when these models are trained on their own generated outputs? Recent...

There’s a whole universe of “worse” short of “collapse”

I appreciate your correctives, Ted. However, here the model not collapsing (with accumulation) is hardly good news, right? In fact, loss still happens, just not catastrophically (right side of figure). One can't seem to say from this paper that quality improves w accumulation but i've only glanced.

No, that’s right — but they’re making no effort to improve the model. When synthetic data is produced to improve performance, it’s designed and filtered. It’s not “just train a model on its own output.” The question being tested here is simply, “if that does happen, does it produce collapse?”

Ok, but two different scenarios are being suggested here. You're suggesting folks intentionally building & useing synthetic data to supplement a training set. The initial reply suggested one where synthetic and real text was being hoovered up in one big training set. Doesn't that make a difference?

The paper tests the second scenario, the pessimistic one, and shows that as long as some real data is retained you don’t get collapse. “Retained” could mean “hoovered” or “held over from 2023.”

The superintelligence has Kuru.

filling the web with ai garbage is part of the plan. the machines are going to make us dumber so it's easier to pass the turning test.

Artificial Ouroboros.

And Google is saying that you can ground your Vertex AI app in Google Search 🤪

It used to be hot garbage. Now it will be cold garbage.

The only way to train AI is to have human created content flagged so it can tell the difference which means we can also use that flag to ignore AI outputs.

Post