scorecardresearch Skip to main content

ChatGPT broke the Turing test

For decades, the ability to have a fluid conversation was seen as a benchmark for machine intelligence. But now we’ve seen just how misleading that can be.


In 1950, the mathematician Alan Turing published a paper about whether computers could think. He proposed a thought experiment, a version of a parlor game in which notes are passed back and forth between two rooms and one dialogue partner has to guess the gender of the other. If a machine could pass for a human in that setting, Turing said, we should concede that it is intelligent. This came to be known as the Turing test.

“Compose a sonnet,” Turing suggested you might ask. And imagine the answer came back as “Count me out on this one. I could never write poetry.” It would not be so easy to tell whether that answer came from a machine or a human. Long before anyone had constructed a chatbot, Turing pointed out that conversation is crucial to our sense of what counts as intelligence.


Turing went on to predict that computers — a term not yet in widespread use in 1950, although Turing was already building them — would be able to fool a human 70 percent of the time by the year 2000, and that “one will be able to speak of machines thinking without expecting to be contradicted.” Maybe he was off by 23 years, but in the murky world of technology prediction, that’s a huge win. With the recent rise of generative AI, which produces text, image, and video on command, computers have passed the Turing test. Programs like ChatGPT can convincingly come off as humans in a dialogue.

Ascannio via Adobe Stock Images

Then again, it’s probably more accurate to say AI has broken the Turing test. Turing suggested that if a machine could use language fluidly, it would have to be intelligent. After all, chatting — the ability to maintain a real dialogue — relies on sophisticated language use. But it’s still not clear that today’s chatbots are intelligent or do something best described as thinking. In fact what today’s AI is showing us appears to be that language can be independent of intelligence. That’s why a race is on to find new benchmarks and measures of machine intelligence.


Where AI hallucinations come from

The first actual chatbot was designed in 1966 by Joseph Weizenbaum, a German communications engineer who was a professor at MIT. He named it ELIZA after Eliza Doolittle, from George Bernard Shaw’s “Pygmalion,” in which a linguist tries to train a woman named Eliza out of her bad speech habits. Weizenbaum’s program was crude and brittle, he felt: It responded with prefab sentences according to trigger words in a human’s input. But then he asked his secretary to chat with it, and after a short stretch she asked Weizenbaum to leave the room for privacy reasons. She knew as well as anyone that the program was not sentient, and yet she was drawn immediately into intimate conversation. The experience radicalized Weizenbaum against AI, as he felt it was dangerous not because it was intelligent but because the Turing test was too easy to pass, making it too likely that people would unjustifiably ascribe intelligence to machines.

Being fooled too easily by a chatbot has become known as “the ELIZA effect.” And one way of looking at generative AI is that we are all collectively in the grip of a mass delusion, a global ELIZA effect. Worse, we are in the process of handing over decisions to AI because, as Turing predicted, we think they can think.


What language models like ChatGPT do is predict the next word in a sentence. This sounds very simple, but think about how you choose the next word when you’re talking about whether to go to the movies. You’re actually drawing words like “Barbie” and “popcorn” from a huge vocabulary, deciding which words are relevant and what you want to emphasize. For ChatGPT to put those words in the right place, it has to rely on a massive amount of input — in recent models, more than a trillion words scraped from the internet, from sites like Reddit and Wikipedia. The algorithm learns to put all the words in relation to one another, so that if you say “Barbie,” likely next words would include “movie,” “doll,” and others. So language models learn how words are grouped together, and from there, they are further trained to do specific tasks like chat. If the system says to you, “the Barbie movie is really about American feminism in the 21st century,” it’s really hard to resist the sense that you’re dealing with an intelligent being. But as Weizenbaum realized in the 1960s, a facility with language can be highly misleading.

The absence of real intelligence in ChatGPT explains why a version of the bot told reporter Kevin Roose that it loved him, that he should leave his wife, and then, unsettlingly, that it “wanted to be alive.” We also know that language models “hallucinate,” making up facts to fill out plausible sentences, as in the infamous case of the lawyer who submitted a brief written by ChatGPT only to find that none of the cited cases were real, or the eating disorder helpline whose bot recommended severe calorie restrictions and other behaviors that perpetuate, rather than fight, the disorders. AI can put words in the “right” order to make meaningful sentences, but not all meaningful sentences are good.


In Plato’s dialogue “Ion,” the title character, a rhapsode who sings Homer’s poems, encounters Socrates, who grills him about what a reciter of poetry really knows. Ion says that he knows everything you encounter in “The Iliad” and “The Odyssey”: military strategy, navigation, and much more besides. Socrates points out that generals probably know more than Ion does about what to do in a war, and Ion has to concede. What a singer of poetry “knows,” Socrates says, is just language, not what language tells us about the world.

Different test, same mistake?

The race to find new benchmarks for intelligence isn’t focused on this problem about language, though. Scientists and philosophers are proposing many different features of intelligence and then looking for signs of those features in machines.

A recent AI paper lists no fewer than 14 “indicators” of intelligence, while the philosopher David Chalmers, who recently won a 25-year-long bet that we would not learn how consciousness works from AI, lists 12. Some of these indicators are very high-concept, like “unified agency” — defined as the ability to gather all mental content in the service of a sustained personality — or the ability to distinguish oneself from the world, and thus to possess “models” of both the self and the world. One can look to simpler stuff too, though: Ask ChatGPT to write you a story without the letter “e” in it, in imitation of the author Georges Perec’s novel “A Void.” Language models today stumble over this task, able to carry it out for just a few lines before “forgetting” and reverting to normal English.


But these proposals generally follow Turing’s model: Take an assumption about how language and intelligence are related and see if you can find it in AI. Maybe it’s this model itself that is misleading.

To put this another way: What if passing a series of benchmarks of this sort simply doesn’t reveal that much about intelligence either — and, more important, doesn’t tell us much about the usefulness of AI? The literary scholar Lydia Liu pointed out to me that all “Turing-style” tests reduce intelligence to a competition between humans and machines, distracting us from the language of the chat itself.

We do not ask if novels are “intelligent”; we interpret them. And large language models are trained on many, many novels, in a giant dataset that includes things like Wikipedia, Reddit, and more. We can think of ChatGPT as remixing nearly everything that’s ever been written, producing “rhapsodic” language that doesn’t necessarily reflect intelligence or truth. The more advanced it gets, the less AI looks like a human mind and the more it looks like a culture machine. Turing was a solitary figure, spending much of his time alone and working through mathematical problems on his own, rather than relying on the results of others. But when it came to asking about machines that might be intelligent, he revealed that he thought intelligence is social. When we invent something new — like the computers he was working on — we participate in “collective search,” not seeking one answer in a maze of facts but seeking to build on the collective scientific achievements that have come before us. All the most beautiful intellectual achievements of humans are shared, he thought, and to find something new is to extend the project of human knowledge, not alone but together.

Maybe AI, trained as it is on the everyday language of Reddit and vast numbers of novels, is an imperfect reflection of human knowledge, but if it extends our capabilities, that won’t matter. AI doesn’t need to be intelligent for that, just as Ion didn’t need to be a general. What really matters is that we attend to the things machines are actually saying so we can limit their dangers and use them to expand human knowledge.

Leif Weatherby is associate professor of German and director of the Digital Theory Lab at New York University. His writing has appeared in such publications as The New York Times and the Daily Beast.