The public unveiling of OpenAI’s ChatGPT on November 30, 2022, was met with justified astonishment along with unjustified hopes and fears that computers might be on the verge of artificial general intelligence (AGI), the ability to perform cognitive tasks as well as or better than humans.
It is now apparent that AGI is not imminent and is not going to be achieved by scaling up ChatGPT or other large language models (LLMs). The core problem is that LLMs are stochastic text predictors—nothing more—and using more data, more parameters, and more compute is not going to give LLMs an understanding of how words relate to the real world.
A more promising path is to make LLMs useful through extensive post-training by domain experts. For example, TurboTax, which has been around long before LLMs appeared, now uses an LLM to generate the text for interacting with users but relies on experts for its calculations and for determining what questions to ask users. I wouldn’t call TurboTax intelligent any more than I would call a pocket calculator intelligent, but I would call it useful.
I recently gave several examples contradicting the claim by OpenAI’s CEO, Sam Altman, that GPT-5 “really feels like talking to an expert in any topic, like a PhD-level expert.” One example that resonated with many correspondents was the prompt, “Please draw me a picture of a possum with 5 body parts labeled.” GPT-5 generated a reasonable rendition of a possum but four of the five labeled body parts were incorrect:
How did Gemini do?
Google’s Gemini 3 LLM seems to be as good as or better than GPT-5. So, I recently tried the possum prompt and Gemini’s answer was, in fact, much better. The drawing was primitive but the five labels were correct:
I gave the same query a day later and Gemini generated a much more realistic picture, again with correct labels (though there were six labels instead of the requested five:
I am quite certain that no LLM (including Gemini) actually “knows” the similarities and differences among various possum parts. They might succeed in matching the label “eye” with a picture containing just an eye, while failing to distinguish between an eye and an ear in a picture containing both. So I cannot help thinking that some Google post-trainers had seen my possum example and trained Gemini to answer correctly.
But what happens when the query is slightly changed?
I have often found that LLMs cannot generalize the way humans can. They consequently stumble on slight variations of classic puzzles. They answer the original question rather than the modification, as they did, for example, in the Monty Hall problem and the river-crossing problem. So, I changed my possum query slightly, asking Gemini for 6 body-part labels: “Please draw me a picture of a possum with 6 body parts labeled.”
The answer was a disaster:
I asked, “What is a sensitive vibarisaie?” And Gemini responded, “It sounds like you’re referring to vibrissae (pronounced vye-BRISS-ee), which is the scientific name for an animal’s whiskers”
I then asked, “What is a fileiinspie?” and Gemini responded, “It looks like you might be running into some tricky scientific terms! Based on the context of possum anatomy, “fileiinspie” is likely a misspelling of filiform papillae.”
I continued, “What are Filiform Papillae?” And it answered, “If you look at a possum’s tongue, it isn’t smooth—it’s covered in tiny, specialized bumps called filiform papillae.”
Trying another angle
I then tried another modification of my possum query, which might have escaped the attention of Gemini’s post-trainers: “Draw me a picture of a kangaroo with 5 body parts labeled.” Another disaster:
One more variation: “Draw me a picture of a rabbit with 5 body parts labeled.” One more flop:
When I later repeated the experiment, “Please draw me a picture of a possum with 6 body parts labeled,” Gemini generated a very realistic picture, with 7 mislabeled body parts:
The original correct answer did not even hold
At this point, I was convinced that Gemini had been trained to answer the 5-body-part possum query but flubbed variations.
Before writing up these results, I tried the original 5-body-part possum question again and was flabbergasted to see that it did not generate a correct answer:
I tried again and got this:
Changing my mind
The great British economist, John Maynard Keynes, reportedly quipped, “When the facts change, I change my mind. What do you do, sir?” I, too, changed my mind about the conclusions to be drawn from these experiments. Yes, LLMs do not know how the words they input and output relate to the real world. Yes, the post-training they receive may be fragile in that seemingly slight variations in the prompts can expose their ignorance. However, it is also true that the stochastic nature of their answers can create problems in testing LLMs and using LLMs.
As I noted in testing five LLMs on two questions (one statistical and one financial), an LLM might answer a question one way and then immediately contradict itself when asked the exact same question seconds later. It might also, as here, give several different, but equally bad answers.
For many applications, these weaknesses can be fatal. For example, customer-service inquiries were initially thought to be low-hanging fruit for LLMs. However, a customer-service bot might answer identical customer queries in a variety of ways, some correct and some incorrect. Post-training might seemingly stabilize the responses to specific queries but then the bot can go off the rails when there are slight variations in the query or how the customer phrases the query.
Among the many examples that have been reported:
- A DPD chatbot swore at a customer, called itself “useless,” and wrote a poem criticizing the company.
- A Cursor chatbot abruptly logged out customers and said this was “expected behavior” under a new company policy.
I recently had a frustrating interaction with a bot used by the Teachers Insurance and Annuity Association of America (TIAA). I simply wanted to know the status of a requested transfer of funds to eTrade. The “Contact Us” phone numbers led to a bot that went around and around in circles with identify-confirmation questions and unhelpful responses that were related to seemingly random words in my query. I gave up. Perhaps that is TIAA’s intention?
Some companies have, in fact, set up obstacle courses to rebuff customer-service requests. For example, in 2023, the federal Consumer Financial Protection Bureau fined Toyota’s auto-financing arm $60 million for various alleged misdeeds, including the creation of a dead-end hotline for canceling products and services. In 2025, Amazon agreed to pay $2.5 billion in penalties and refunds to settle a Federal Trade Commission lawsuit alleging that Amazon had used a variety of deceptive practices to “trick” shoppers into enrolling in auto-renewing Prime subscriptions, and then making it difficult to cancel these subscriptions.
For individuals and businesses that are honest and ethical and are seeking trustworthy medical, legal, financial, business, or life advice, for example, the inherent frailty and unreliability of LLMs is a flaw not a feature. A simple way to remember this lesson is that it is treacherous to rely on something that can’t tell the difference between a possum’s snout and its tail.
