In a recent New Yorker article, “The Case That A.I. Is Thinking,” James Somers explains his conversion from finding “comfort in the idea that [large language models (LLMs)] had little to do with real intelligence or understanding” to believing that “these models have become increasingly intelligent” and that ChatGPT now “seems to know what it’s talking about.”
I am not convinced.
An illusion is an illusion
Somers’ key argument is: “How convincing does the illusion of understanding have to be before you stop calling it an illusion?” Yes, the confident blather produced by LLMs conveys an illusion of understanding. But an illusion is an illusion, no matter how well it is crafted.
After seeing a master magician present a convincing mind-reading illusion, should we trust the magician to read minds outside the performance venue? After seeing Penn and Teller present their bullet-catch act, in which they seemingly fire guns at each other and catch bullets with their teeth, should we trust them with different guns in other settings?
It is relatively easy to puncture the LLM veneer of understanding. For example, I recently gave this prompt to GPT-5:
Explain comedian Will Rogers’ joke that, “When the Okies left Oklahoma and moved to California, they raised the average intelligence level in both states.
This paradox is so well known that it is called “the Will Rogers phenomenon” and appears in tens (or hundreds) of thousands of pages on the Internet, including its own Wikipedia page, with a clear explanation and detailed numerical examples. Even so, when I gave GPT-5 the prompt six times (with a New Chat refresh after each trial), GPT-5 gave an incorrect answer five times. For example,
Oklahoma’s average intelligence went up because its lower-intelligence residents left.
California’s average intelligence also went up because the incoming Okies were still less intelligent than Californians — but their departure from Oklahoma raised Oklahoma’s average even more.
Its answers were consistently self-assured but incorrect on five of six tries because it did not understand the joke: Will Roger’s jab is that while the Okies may have been below-average in Oklahoma, they were above-average in California.
Humans understand the joke because we know what an “average” is, not just that is related to words like “mean” and “standard deviation.” We understand how averages are increased or decreased by adding or taking away above-average or below-average values. GPT and other LLMs do not understand any of this and will not gain an understanding by scaling up on more references to the Will Rogers phenomenon.
Another compelling example of the fact that LLMs do not know how words relate to the physical world is the prompt, “Please draw me a picture of a possum with 5 body parts labeled.” Here is a recent response by GPT-5:
Correlation is not causation
Somers writes:
In statistics, when you want to make sense of points on a graph, you can use a technique called linear regression to draw a “line of best fit” through them. If there is an underlying regularity in the data—maybe you’re plotting shoe size against height—the line of best fit will efficiently express it, predicting where new points could fall. The neocortex can be understood as distilling a sea of raw experience—sounds, sights, and other sensations—into “lines of best fit,” which it can use to make predictions.
Somers’ description of regression is fine (though it need not be a linear model) but it is easy to overlook the crucial qualifier “underlying regularity.” Correlation is not — wait for it — causation. For regression models to make reliable predictions, there must be an underlying structural relationship. For a simple regression model with two variables, there may be a causal relationship between the two variables. For example, household income and spending might be positively correlated because an increase in income tends to increase spending — which means that income can be used to predict spending. Or there may be causal relationships between the two variables and other variables. For example, student scores on two tests might be positively correlated because they both depend on the students’ mastery of the material being tested — which means that scores on one test can be used to predict scores on the other test.
If there is no underlying causal relationship, then an observed coincidental correlation is fleeting and useless for making predictions. For example, a study of Donald Trump’s tweets during the three year period following the 2016 presidential election found a close negative relationship between Trump tweeting the word “with” more frequently and the stock price of Urban Tea, a tea product distributer headquartered in China, four days later.
Computers excel at finding such statistical patterns but struggle when it comes to distinguishing between causation and mere correlation because they have no idea how the data they input and output relate to the real world. They can go to the internet but that is hardly understanding. Plus the internet is increasingly polluted by disinformation generated by LLMs. It is estimated that more than half of the articles on the internet are now AI-generated, much of which is either intentionally or unintentionally false. Humans can train LLMs to label specific statistical relationships as causal or coincidental, but following instructions is a very limited kind of intelligence.
Correlation vs. causation and mutual funds
A real-world example of such struggles is the performance of AI-powered mutual funds. As of December 31, 2024, ten fully AI-powered funds have been launched. Every single one has underperformed the S&P 500 and five have been shuttered. The average annual return was –1.50% (yes, that’s a negative sign) compared to 8.55% for the S&P 500. There have been 44 partly AI-powered funds that allow humans to override AI-suggested trades. Ten have done better than the S&P 500; 34 have done worse. The average annual return has been 5.31% worse than the S&P 500.
It is still perilous to believe that “AI is thinking.” It is still true that the real danger today is not that computers are smarter than us but that we think computers are smarter than us and consequently trust them to make decisions they should not be trusted to make.
