Rachel Flood Heaton: These folks did a great job of breaking, well, *all* the leading LLMs by asking the models an incredibly simple question. These models should not be relied upon to assist anyone with reasoning. Check out the appendix with examples of truly wild confabulations and errors arxiv.org/html/2406.02...

These folks did a great job of breaking, well, *all* the leading LLMs by asking the models an incredibly simple question. These models should not be relied upon to assist anyone with reasoning. Check out the appendix with examples of truly wild confabulations and errors arxiv.org/html/2406.02...

Alice in Wonderland: Simple Tasks Showing Complete Reasoning Breakdown in State-Of-the-Art Large Language Modelsarxiv.org

Why does this happen? Well, this almost always happens to "foundational" models when you try to get them to reason about uncommon relational structures between things. More than being a fringe unfair example, these are the tasks for which people are likely to actually need assistance from an LLM.

TLDR; LLMs completely fail at just the kinds of reasoning tasks people might actually need to cognitively offload.

It's hilarious. Should be publicised as widely as possible.

What a great and well-designed study. In my previous life I was a philosopher and wrote a diss on context sensitivity in semantic interpretation, so work like this that highlights LLMs’ elementary failures in simulating reasoning that incorporates contextual information is music to my ears!

I am so happy to see you use “confabulate” to describe the process, and I think this term helps point to why: the system is only built to produce text that resembles the training corpus, there has been no effort to add any relationship to logo or reasoning

Just want to be clear that term was applied by the original authors, Nezhurina et al. I also think it's a great way to describe it.

there is no reason why an LLM would get this right it's a statistical model of language -- it does not understand -- it just parrots previous text -- if a question has not been often asked and answered, any LLM will get it wrong

Completely agree. But we know that's not how a lot of researchers and adopters see these things.

IMO, this is one of those things where it's completely unsurprising, but also *someone* needs to publish the paper pointing it out.

Personally, I'm also fascinated at how much better they do on standardized reasoning tests, which suggests to me that they have been very thoroughly taught to those tests.

Oh yeah, like at least a couple of versions of ChatGPT ago if you gave it the Wason task it acts like humans but if you change the letters you use for the variables it does not act like a human. It's like they're putting bandaids on all the cracks hoping we won't go poking around too much.

I remember when first ChatGPT couldn't handle the Monty Hall problem and then it couldn't handle sabotaged Monty Hall problems (with things like "but you can see through the doors and know the answer") because it had been retrained to over-index on the first one. (and also it's bad at puzzles)

This reminds of Google's response to the bad AI search results, which amounted to "please stop quality testing our AI, it wasn't meant for that"

This is not about LLMs, but check out our paper about vision and deep nets.

Deep problems with neural network models of human vision | Behavioral and Brain Sciences | Cambridge Corewww.cambridge.org Deep problems with neural network models of human vision - Volume 46

I generally subscribe to the philosophy that the question of whether a computer can think is akin to that of whether a submarine swims. (paraphrasing Dijkstra) With that said, that abstract is interesting. (but I am not an academic)

I model human visual reasoning with computers so I had better think that we capture a lot of aspects of the computations of thought with a computer. But the model emulates the real system's computations on silicon, and they aren't completely commensurate, so there are lots of limitations.

Yes, but that's not how LLMs are being sold or championed by the people doing most of the selling and championing.

It's also important to understand that big models suffer in several ways from their size (despite what proponents try to claim) - they are never unified coherent models, they end up being networks of conflicting models with unpredictable interconnections

That's why you can ask it to mimic different groups, ideologies, etc, because then it relies more on those specific samples related to those topics. It's also why they randomly vary in performance, because they might trigger different networks each time for the same skill, trained differently

They don't recognize redundancies in their own training so they don't prune themselves (and can't guarantee coherence), they don't generalize well, they don't handle formal logic well (and sometimes not at all), and they can't even reliably remember which definition of a word is in use in a prompt

"Confabulations" is a great new descriptor. The LLMs seem to be using the same kind of confident (but daft) logic that characterizes conspiracy theories to justify their answers.

This bit right here is my favorite part of the paper and I almost died laughing.

Just replicated this with Copilot in Bing, what a wild ride

My 3yo was sure my sister's child was her cousin, but that she was *not* my new niece's cousin. Took a few minutes to figure it out: this was her first contact with a reciprocal relative relationship (1st grandchild on both sides). The LLMs weren't advised of this. Are you smarter than a 3yo?

I wonder how long it will take for all these models to suddenly excel at this specific problem... But continue to fail at new similarly simple ones. 🤪

📌

😳 maybe shouldn't be generating search result #bigtech

Interesting article, but a bit sensationalist. LLMs are notoriously bad at math word problems. The core architecture is theoretically incapable of representing arithmetic in a meaningful way. The confabulations and overconfidence are a major issue with the current chatbot use cases, though.

There is no reason why a model which gives different answers each time it is asked should be claiming that it is confident in the answer. The system could just try generating a few different answers, see that they aren't reliably the same, and admit it doesn't know.

I'm trying to train my local model to extract useful information, and then send a single line of code to a python interpreter to do math, and then state the answer in terms of "according to my calculations." This is very difficult and I have to write many scaffolds and examples! But doable!

Yeah, I think that would be necessary if one wanted to get an LLM to do math well. I really like the Langchain framework for setting up flows like that.

Wolfram Alpha also has a plugin for integration with their tools for chatgpt. It's interesting to see all the places they note in the articles that sometimes the LLM makes garbled queries (bad syntax or functions that don't exist), or just forgets to use the plugin, or misrepresents the response

The amazing thing about LLMs is the incredibly variety of tasks which they can perform badly.

I'm sitting in the Cheshire Cat community because I'm using it for my local AI experiments (btw, it's a very nice Langchain project that makes this easy) and there's someone talking about how bad every model except Phi is at function calls. Also Phi and Cohere are both pretty bad at function calls.

AI language models be like:

I've actually been wondering if multimodal models would be better at math problems if they're trained to generate images to "visualize" the problem and then use those images to solve the problem. Generate an image with 2 squares and 4 circles and then count the shapes.

they aren’t reasoning in any sense. They are creating strings that meet their standards for being well formed.

I did the like the one where the AI tried to be sensitive or something and starting going on about not judging disability?

So, they found thst the things that aren't claimed to reason, can't reason?

But people *are* claiming that they reason! That's exactly the problem. If it was being marketed properly and used for what we'd expect, we wouldn't care. But it's being marketed as "artificial intelligence", which to non-technical people means something that thinks and reasons.

People are also claiming the earth is flat. People lie. It has fuck all to do with LLMs.

Also, you'll note, the people lying about LLMs aren't the people *producing* LLMs. Go ask anyone involved in producing LLMs if they can reason, and when they get done laughing at you, they might tell you "No" if they think it's worth the effort.

Well, maybe those folks should go talk to their marketing divisions, because I've sat in meetings where I'm assured that the LLMs in question can *totally* reason out which of several possible subjects are being discussed and provide the most useful answer in search results. (They can't.)

And which LLM did that company produce?

Developers at Microsoft, OpenAI, and other AI dev entities regularly argue that LLMs can reason. I’ve laughed at them many times on Twitter. Their arguments always amount to a reductive redefinition of “reasoning.” It’s magical thinking. It’s amazing that AI developers are so vulnerable to it.

If you need examples, go to X/Twitter, look up Grady Booch, find one of his many criticisms of LLMs, and note who argues with him.

lol Every single executive in charge of AI efforts is lying about LLMs on a regular basis.

Post