rahaeli: The reason this is happening all over the place, btw, is that large pre-2023 corpora of text written by actual humans is the low-background steel (en.wikipedia.org/wiki/Low-bac... ) of LLMs right now -- if LLMs continue to exist in another 5, 10 years, they will be forever frozen in 2023.

The reason this is happening all over the place, btw, is that large pre-2023 corpora of text written by actual humans is the low-background steel (en.wikipedia.org/wiki/Low-bac... ) of LLMs right now -- if LLMs continue to exist in another 5, 10 years, they will be forever frozen in 2023.

"Tumblr and Wordpress are preparing to sell user data to Midjourney and OpenAI, according to a source with internal knowledge about the deals and internal documentation referring to the deals." www.404media.co/tumblr-and-w...

Tumblr and Wordpress to Sell Users’ Data to Train AI Toolswww.404media.co Internal documents obtained by 404 Media show that Tumblr staff compiled users' data as part of a deal with Midjourney and OpenAI.

If you train a LLM on text that includes texts written by LLMs, it gets weird. If you train a LLM on texts that include texts written by LLMs that were trained on LLMs, it gets even weirder. After a few iterations, it goes foom.

(For the record, Dreamwidth has never been contacted by anyone looking to license our text corpus, but if we were, our answer would be "no, and also fuck you.")

i feel like "releasing and aggressively marketing our product that requires large amounts of human-normal text to perform optimally will lead to diminished human-normal text" is one of those things they should have probably thought of beforehand?

Google already hosed themselves the same way on search results by trying to keep you in a Google page, shitting in their own well is an emergent behavior for them.

You need to start by believing deep in your soul that the human-normal text is superior as a class to reach that conclusion, which rules out all the sTem-poisoned bros making LLMs in the first place

Thank you

Like, if someone individually wants to license their writing to train a LLM, have at it. I don't think the entire *field* is *inherently* unethical. But boy are most of the current implementations unethical as fuck.

But I sure can't make that choice for people! (Especially since our rights grant clause doesn't allow it: we deliberately have a *really* narrowly scoped limiter on the clause.)

Any clues that LJ might sell their content? Even though many people moved off I bet there's still plenty of old posts there. And even if one delete(d)(s) their account is it likely they still have the words?

And those of us who are on DW thank you very much for that.

So they basically poison themselves the more they're used in places they scrape data from? So it puts a hard limit on how much the internet will use it?

Basically, the biggest problem with training a LLM is "where do you get the large corpus of natural human speech in a given language to feed it". The internet was a boom for linguistics researchers because it generated more examples of natural human speech than ever before!

But the release of easy-to-use LLMs means you can no longer go "okay, this corpus was (mostly) human-authored". Which is why there's such a demand for pre-2023 corpora (and why there's such controversy about the sources LLMs are using for their training data).

Social media sites that existed before 2023 have a huge body of "natural human speech in a given language" that LLM makers can be confident was human-authored and is subsequently in high demand, and some of them are trying to monetize it. :/

Humans can, do, and must learn how to write, draw, whatever by studying what other humans have done, and the work they produce from learning is used by others, and so ad infinitum. If an LLM can't be trained on its own output, it is proof its output is useless and subtracts from human knowledge.

I mean, the whole concept of "The AI singularity" was the AI would be able to continually improve *itself*. But if it can only produce output less useful to its future training than someone's 1999 Street Sharks Erotic Fanfic Site (NTTAWWT), what's the bloody point?

Like grinding up dead cows to bulk up the cattle feed, and then finding out about prions.

dead sheep, but your point remains

And Google is sitting on the dejanews and Google group archives

I cannot imagine that they haven't already been licensed.

No, it creates a sinking lid on how useful the internet will be.

More like, it eliminates "the internet" as a valid source of training data. (I mean, LLMs in general are having an extreme negative effect on how useful the internet is, but that's a separate problem.)

Maybe "high-value", accessible only by subscription, LLMs will stop using the internet as a data source. But there will be plenty of low value, cheap, (or free with ad-placement) LLMs that will feed on (and continue to pollute) a LLM infested internet.

This is an interesting quirk since things taught through multiple humans gets refined. With AI it just turns to junk? Lol

I think I watched this movie. At some point we are going to have to clip their toenails at night. #MultiplicityReference

The ultimate expression of Generation Loss. Some kind of digital entropy that simply results in an even plane of diffuse chaos.

It's Spämmerdämmerung, the Twilight of the Posters.

www.youtube.com/watch?v=QqCi...

Star Trek - Everything Harry tells you is a Liewww.youtube.com I Mudd...Norman gets fried

But how do we get to step foom fastest and with as few detours into trying to find new uncontaminated texts as possible? Yes, I want to watch the LLM world burn

Knowing what I do of technology and the way everyone reinvents the wheel, I suspect that the 2023 LLMs will become useless, fall out of favor, humans will write more stuff, and then in like 2039, some next gen techbro will start it up again, so we’ll have a series of LLMs stuck in various years.

You make a lot sense, but LLMs will disappear the instant that they are found lacking in value. They cost too damn much to run, (which will be their inevitable downfall, in my opinion).

When had that ever stopped a tech bro from trying? Future Bro will just say “it’ll work when WE do it.”

There *are* ways to compensate for the recursion effect and prevent model collapse, but they're harder to do and take a lot more effort (making the process of training even more difficult and expensive).

ButItMightWorkForUs.jpg

That and I imagine the ember of LLM usefulness will be kept alive in software engineering where you can have a useful model that runs on your laptop. Also the perfect place for a tech bro to make the 'software solves my problems, so clearly it can solve ALL problems!' mistake yet again

When the angel investment money dries up, as it becomes clear the ROI will never materialize.

... or find a way to push the costs off on "independent contractors" of some kind, while getting paid rent on it somehow.

Sorry, I wasn't clear what I meant. I'm not saying future tech bro won't do it again. You are right there.I'm just saying the previous generations will be gone. I had thought you were saying there would be a series of LLM generations hanging around. I disagree with that.

So you mean they'll disappear the way bitcoin has? any day now...

So they really *are* gonna be the next crypto...fantastic.

For some reason the idea of someone plagiarizing the heartbroken journal posts I wrote as a new college grad makes me even more disgusted then someone plagiarizing my actual creative work.

Low-background steel is a fantastic metaphor here.

Seconded

Nobody should ever want any product that's going to train on *tumblr* of all parts of the internet. We sound like nonsense most of the time even to ourselves. Most of my TL has been debating being more surprised by a faerie or a walrus showing up at your door for like a week.

That’s what I was thinking. tumblr has its own, ever-evolving dialect.

And in-jokes. Like "You have nice shoelaces" Or "Remember DashCon"

This also sort-of implies that LLMs are going to hit peak utility and then decline, as their information grows increasingly outdated and any new information is irreparably tainted by other LLMs.

So if humanity had one shot to train machine models on all of the text we'd ever written, then this may prove to be a less-than-optimal moment for it; 5 years from now we'll have much better algorithms and nothing to train them on.