scorecardresearch Skip to main content

Is AI really as good as advertised?

Claims that machines will surpass human smarts overlook one bedrock problem: A lot of real-world AI doesn’t work all that well.

After being falsely accused of a carjacking when she was eight months pregnant, Porcha Woodruff is suing the City of Detroit over what she says is an overreliance on facial recognition technology. Facial recognition is one of several AI-related technologies called out by researchers in a 2022 paper, "The Fallacy of AI Functionality." "Deployed AI systems often do not work," the researchers wrote.Carlos Osorio/Associated Press

Like a many-tentacled Kraken, new artificial intelligence is set to strangle human dominion on Earth — or so some experts now believe. The so-called godfather of AI, Geoffrey Hinton, quit his post at Google out of concern about where the technology is headed. AI, he thinks, will soon manipulate humans like an adult plucking a lollipop from a toddler. “Smart things can outsmart us,” he said at this year’s Emtech Digital event at MIT. “If these things get carried away with getting more control, we’re in trouble.” Data scientist Ken Jee put the anticipated AI-human showdown in even starker terms: “We must adapt to survive or else face extinction.”

But these highflown what-ifs may run into a more banal reality: Many AI programs simply don’t work that well. The AI errors and hallucinations we see now are hardly flukes — for years, AI has had major functionality problems that eager adopters have overlooked.


Though you wouldn’t know it from all the recent hype, lots of AI ventures are moonshots that never quite land. AI body scan readers have failed to replace human radiologists, despite predictions of the specialty’s demise; and safe, dependable self-driving cars have so far proved elusive. In a 2023 Altair Engineering survey, business respondents reported that 36 percent of their AI projects flopped. In some cases, the stats are much worse. When a University of Cambridge team studied 62 AI models in 2021 that were programmed to diagnose COVID from chest X-rays or CT scans, they found that exactly none of the models was fit for clinical use due to various flaws and biases.

Whether an AI project dies on the research vine or sputters after it’s deployed (thanks to “launch first, fix later” credos), low-quality data input is often a prime culprit. AI systems learn by mainlining data, so if you want your AI system to, say, diagnose COVID from a chest scan, you’ll feed in as many previous scans as you can, specifying which scans came from people with COVID and which ones did not. Later, when the system reads a new scan, it will use its accumulated knowledge to judge whether or not COVID is present.


But in practice, says Brown University AI researcher Lizzie Kumar, bad AI training data has long been rampant, leading to bad system performance. Some of the COVID diagnosis programs in the University of Cambridge study were learning from biased data, such as unusual chest scans that did not represent general population scan results — a misstep that set them up for failure.

Likewise, some AI tools use faulty criminal justice data — riddled with biases and reporting errors — in trying to predict things like whether a given criminal will reoffend, yielding skewed results that can lead judges to impose too harsh sentences. “It’s hard to make a philosophically sound argument that you should be using that data,” says Kumar, who is coauthor of a 2022 paper called “The Fallacy of AI Functionality.” “Why would you ever trust that data as ground truth?”

While press coverage has focused on notorious cases in which, for instance, predictive AI treats Black defendants differently from white ones, these systems can also fail much more broadly. The accuracy of COMPAS, a popular recidivism risk calculator, hovered just above 60 percent in one study regardless of defendants’ race. And yet many US courts have relied on it for years.


It’s tempting to assume such shortcomings can be resolved with more thorough and accurate training data, but AI has shown other deficits that even the best data can’t remedy. Generative AI systems like ChatGPT make a flurry of opaque calculations to devise novel replies to questions. That process veers off the rails when chatbots make up facts, a surprisingly difficult glitch to correct. Because programs like ChatGPT make such complex connections between data points to generate answers, humans often cannot understand their reasoning, a longstanding AI issue known as the “black box problem.” Since we can’t see the box’s inner workings, fixing them is akin to wielding a wrench blindfolded. Of course, that hasn’t stopped companies from forging ahead with generative AI tools and sweet-talking us into trusting them.

To protect us from our own credulity, critics are urging government regulators to crack down on companies whose AI products don’t work reliably. The Federal Trade Commission is “already signaling that they’re on board with this principle that you should be protected from unsafe or ineffective systems,” Kumar says. But grassroots efforts could prove just as crucial. Landlords get to decide whether they’ll use AI-based tenant screening aids that claim fairness but discriminate against many applicants. Health insurance companies get to decide whether they’ll reimburse providers for offering patients unproven AI mental health apps. And editors get to decide whether they’ll trust AI composition tools that churn out dubious facts for the masses.


Will AI someday reason circles around us, as Hinton warns? For now, that almost seems beside the point. Futurists will continue generating splashy headlines, but for the rest of us, the next best move is both simpler and more profound: Evaluate each AI system individually — with the same scrutiny we’d bring to a telemarketing pitch or a “One Weird Trick” email — and act accordingly.

Elizabeth Svoboda, a writer in San Jose, Calif., is the author of “What Makes a Hero?: The Surprising Science of Selflessness.”