Not totally unsurprising though. This might sound weird, but LLM's don't really know what a "letter" is. They process text as essentially whole words, or at least phonics sized word chunks.
They pick up some spelling through training, but it's still sort of an alien concept to the architecture.
This is what people don’t understand. Adding references doesn’t matter cause the program is stringing word tokens that it associates together, that’s it. There is no cognition. It’s just suggested text on steroids.
Exactly.
It's also why "art' programs can't seem to manage text. They average pixels in areas they know contain "writing", resulting in things that are almost, but not quite, letters.
Text gen has actually improved significantly in the recent gen of models. The trick was to build a new captioner model to make sure that any text in the training set images was included in the training caption.
DALLE-3 (Bing image gen) now handles letters decently, even if its words can be nonsense
As the DALLE-3 paper* notes, the fact that their model is still thinking in word tokens instead of letters is probably holding it back in that arena too.
* cdn.openai.com/papers/dall-...