Post

Avatar
These folks did a great job of breaking, well, *all* the leading LLMs by asking the models an incredibly simple question. These models should not be relied upon to assist anyone with reasoning. Check out the appendix with examples of truly wild confabulations and errors arxiv.org/html/2406.02...
Alice in Wonderland: Simple Tasks Showing Complete Reasoning Breakdown in State-Of-the-Art Large Language Modelsarxiv.org
Avatar
Why does this happen? Well, this almost always happens to "foundational" models when you try to get them to reason about uncommon relational structures between things. More than being a fringe unfair example, these are the tasks for which people are likely to actually need assistance from an LLM.
Avatar
TLDR; LLMs completely fail at just the kinds of reasoning tasks people might actually need to cognitively offload.
Avatar
It's hilarious. Should be publicised as widely as possible.
Avatar
What a great and well-designed study. In my previous life I was a philosopher and wrote a diss on context sensitivity in semantic interpretation, so work like this that highlights LLMs’ elementary failures in simulating reasoning that incorporates contextual information is music to my ears!
Avatar
I am so happy to see you use “confabulate” to describe the process, and I think this term helps point to why: the system is only built to produce text that resembles the training corpus, there has been no effort to add any relationship to logo or reasoning
Avatar
Just want to be clear that term was applied by the original authors, Nezhurina et al. I also think it's a great way to describe it.
Avatar
there is no reason why an LLM would get this right it's a statistical model of language -- it does not understand -- it just parrots previous text -- if a question has not been often asked and answered, any LLM will get it wrong
Avatar
Completely agree. But we know that's not how a lot of researchers and adopters see these things.
IMO, this is one of those things where it's completely unsurprising, but also *someone* needs to publish the paper pointing it out.
Personally, I'm also fascinated at how much better they do on standardized reasoning tests, which suggests to me that they have been very thoroughly taught to those tests.
Avatar
Oh yeah, like at least a couple of versions of ChatGPT ago if you gave it the Wason task it acts like humans but if you change the letters you use for the variables it does not act like a human. It's like they're putting bandaids on all the cracks hoping we won't go poking around too much.
I remember when first ChatGPT couldn't handle the Monty Hall problem and then it couldn't handle sabotaged Monty Hall problems (with things like "but you can see through the doors and know the answer") because it had been retrained to over-index on the first one. (and also it's bad at puzzles)
Avatar
This reminds of Google's response to the bad AI search results, which amounted to "please stop quality testing our AI, it wasn't meant for that"
Avatar
I generally subscribe to the philosophy that the question of whether a computer can think is akin to that of whether a submarine swims. (paraphrasing Dijkstra) With that said, that abstract is interesting. (but I am not an academic)
Avatar
I model human visual reasoning with computers so I had better think that we capture a lot of aspects of the computations of thought with a computer. But the model emulates the real system's computations on silicon, and they aren't completely commensurate, so there are lots of limitations.
Avatar
Yes, but that's not how LLMs are being sold or championed by the people doing most of the selling and championing.
Avatar
It's also important to understand that big models suffer in several ways from their size (despite what proponents try to claim) - they are never unified coherent models, they end up being networks of conflicting models with unpredictable interconnections
Avatar
That's why you can ask it to mimic different groups, ideologies, etc, because then it relies more on those specific samples related to those topics. It's also why they randomly vary in performance, because they might trigger different networks each time for the same skill, trained differently
Avatar
They don't recognize redundancies in their own training so they don't prune themselves (and can't guarantee coherence), they don't generalize well, they don't handle formal logic well (and sometimes not at all), and they can't even reliably remember which definition of a word is in use in a prompt
Avatar
"Confabulations" is a great new descriptor. The LLMs seem to be using the same kind of confident (but daft) logic that characterizes conspiracy theories to justify their answers.
Avatar
This bit right here is my favorite part of the paper and I almost died laughing.
Avatar
Just replicated this with Copilot in Bing, what a wild ride
Avatar
My 3yo was sure my sister's child was her cousin, but that she was *not* my new niece's cousin. Took a few minutes to figure it out: this was her first contact with a reciprocal relative relationship (1st grandchild on both sides). The LLMs weren't advised of this. Are you smarter than a 3yo?
Avatar
I wonder how long it will take for all these models to suddenly excel at this specific problem... But continue to fail at new similarly simple ones. 🤪
Avatar
Avatar
Avatar
Interesting article, but a bit sensationalist. LLMs are notoriously bad at math word problems. The core architecture is theoretically incapable of representing arithmetic in a meaningful way. The confabulations and overconfidence are a major issue with the current chatbot use cases, though.
Avatar
There is no reason why a model which gives different answers each time it is asked should be claiming that it is confident in the answer. The system could just try generating a few different answers, see that they aren't reliably the same, and admit it doesn't know.
Avatar
I'm trying to train my local model to extract useful information, and then send a single line of code to a python interpreter to do math, and then state the answer in terms of "according to my calculations." This is very difficult and I have to write many scaffolds and examples! But doable!
Avatar
Yeah, I think that would be necessary if one wanted to get an LLM to do math well. I really like the Langchain framework for setting up flows like that.
Avatar
Wolfram Alpha also has a plugin for integration with their tools for chatgpt. It's interesting to see all the places they note in the articles that sometimes the LLM makes garbled queries (bad syntax or functions that don't exist), or just forgets to use the plugin, or misrepresents the response
Avatar
The amazing thing about LLMs is the incredibly variety of tasks which they can perform badly.
Avatar
I'm sitting in the Cheshire Cat community because I'm using it for my local AI experiments (btw, it's a very nice Langchain project that makes this easy) and there's someone talking about how bad every model except Phi is at function calls. Also Phi and Cohere are both pretty bad at function calls.
Avatar
Avatar
I've actually been wondering if multimodal models would be better at math problems if they're trained to generate images to "visualize" the problem and then use those images to solve the problem. Generate an image with 2 squares and 4 circles and then count the shapes.
Avatar
they aren’t reasoning in any sense. They are creating strings that meet their standards for being well formed.
Avatar
I did the like the one where the AI tried to be sensitive or something and starting going on about not judging disability?
Avatar
So, they found thst the things that aren't claimed to reason, can't reason?
Avatar
But people *are* claiming that they reason! That's exactly the problem. If it was being marketed properly and used for what we'd expect, we wouldn't care. But it's being marketed as "artificial intelligence", which to non-technical people means something that thinks and reasons.
Avatar
People are also claiming the earth is flat. People lie. It has fuck all to do with LLMs.
Avatar
Also, you'll note, the people lying about LLMs aren't the people *producing* LLMs. Go ask anyone involved in producing LLMs if they can reason, and when they get done laughing at you, they might tell you "No" if they think it's worth the effort.
Avatar
Well, maybe those folks should go talk to their marketing divisions, because I've sat in meetings where I'm assured that the LLMs in question can *totally* reason out which of several possible subjects are being discussed and provide the most useful answer in search results. (They can't.)
Avatar
And which LLM did that company produce?
Avatar
Developers at Microsoft, OpenAI, and other AI dev entities regularly argue that LLMs can reason. I’ve laughed at them many times on Twitter. Their arguments always amount to a reductive redefinition of “reasoning.” It’s magical thinking. It’s amazing that AI developers are so vulnerable to it.
Avatar
If you need examples, go to X/Twitter, look up Grady Booch, find one of his many criticisms of LLMs, and note who argues with him.
Avatar
lol Every single executive in charge of AI efforts is lying about LLMs on a regular basis.