rahaeli: Social media sites that existed before 2023 have a huge body of "natural human speech in a given language" that LLM makers can be confident was human-authored and is subsequently in high demand, and some of them are trying to monetize it. :/

The reason this is happening all over the place, btw, is that large pre-2023 corpora of text written by actual humans is the low-background steel (en.wikipedia.org/wiki/Low-bac... ) of LLMs right now -- if LLMs continue to exist in another 5, 10 years, they will be forever frozen in 2023.

"Tumblr and Wordpress are preparing to sell user data to Midjourney and OpenAI, according to a source with internal knowledge about the deals and internal documentation referring to the deals." www.404media.co/tumblr-and-w...

Tumblr and Wordpress to Sell Users’ Data to Train AI Toolswww.404media.co Internal documents obtained by 404 Media show that Tumblr staff compiled users' data as part of a deal with Midjourney and OpenAI.

If you train a LLM on text that includes texts written by LLMs, it gets weird. If you train a LLM on texts that include texts written by LLMs that were trained on LLMs, it gets even weirder. After a few iterations, it goes foom.

So they basically poison themselves the more they're used in places they scrape data from? So it puts a hard limit on how much the internet will use it?

Basically, the biggest problem with training a LLM is "where do you get the large corpus of natural human speech in a given language to feed it". The internet was a boom for linguistics researchers because it generated more examples of natural human speech than ever before!

But the release of easy-to-use LLMs means you can no longer go "okay, this corpus was (mostly) human-authored". Which is why there's such a demand for pre-2023 corpora (and why there's such controversy about the sources LLMs are using for their training data).

Social media sites that existed before 2023 have a huge body of "natural human speech in a given language" that LLM makers can be confident was human-authored and is subsequently in high demand, and some of them are trying to monetize it. :/

Humans can, do, and must learn how to write, draw, whatever by studying what other humans have done, and the work they produce from learning is used by others, and so ad infinitum. If an LLM can't be trained on its own output, it is proof its output is useless and subtracts from human knowledge.

I mean, the whole concept of "The AI singularity" was the AI would be able to continually improve *itself*. But if it can only produce output less useful to its future training than someone's 1999 Street Sharks Erotic Fanfic Site (NTTAWWT), what's the bloody point?

Post