scorecardresearch Skip to main content

This time, the hype about AI in medicine is warranted

I wouldn’t take medical advice from GPT-4 or any other machine without checking it with a real doctor. But I suspect that will change before long.

Health care is emerging as one of the fields with the most to gain from GPT-like systems — but also a domain where the risks could be high.MARCO BERTORELLO/AFP via Getty Images

In recent days, a border collie named Sassy became a social media star, and not just for her cocked-ear charisma. A medical crisis turned her into a harbinger of how the new wave of artificial intelligence may soon transform health care.

The tweets that garnered her more than 9 million views began with a grabber: “#GPT4 saved my dog’s life.”

Sassy had contracted babesiosis, a tick-borne illness that is now endemic in the Northeast. She was treated for it but then developed severe anemia that mystified the veterinarian.

Her worried owners fed her symptoms and test results into GPT-4, the far more capable new version of the wildly popular ChatGPT, and it suggested an autoimmune type of anemia. They mentioned as much to a second vet, who tested Sassy, confirmed the diagnosis, and treated her. She’s now almost fully recovered.


“In my view, the moral of the story isn’t ‘AI will replace doctors,’ but ‘AI could be a useful tool in the hands of doctors,’” says her owner, who asked to be identified only as Cooper. (He provided copies of her lab results to show the unexpectedly viral story was true.)

Sassy, a border collie who has recovered from a form of anemia that her owners discovered by plugging her symptoms into GPT-4.Used with permission of Sassy's owner

That vision of new AI as a potentially game-changing tool in health care is rapidly spreading. Last week the august New England Journal of Medicine launched an “AI in Medicine” series and announced it will field a whole new journal, NEJM AI, next year.

“To my colleagues in the medical establishment, let’s not leave it to others to guide how this world-changing technology is implemented,” says the new journal’s editor, Dr. Isaac Kohane, chair of Harvard Medical School’s Department of Biomedical Informatics. “Let’s make sure it’s safe and helps our patients.”

I’ve spent the last several months working on a book about AI and medicine with Kohane and Peter Lee, who heads research at Microsoft, which partners with GPT-4 creator OpenAI. I’ve been embedded with them and other researchers while they had early access to GPT-4, one of several new “large language models,” including Google’s Bard. My main question: What can patients and health care providers expect from the new AI?


After all, health care is emerging as a sphere where the potential benefits could be among the greatest — but also where risks could be high, particularly from the falsehoods that chatbots sometimes persuasively peddle as facts. And medicine has been burned by AI hype in the past, most notably when IBM’s Watson failed to live up to promises it would transform cancer care. The approved uses of the technology — analyzing scans, predicting crises — have remained relatively narrow.

But even with those caveats in mind, it’s hard to avoid the conclusion that this time, with these models, we’re on the verge of major AI-driven change in medicine.

The team I’ve been following has gotten some mind-blowing results.

The model is shockingly good at helping identify diagnoses and optimal treatments — better than many doctors, according to Kohane, who combines a computer science PhD with an MD specializing in pediatric endocrinology. In one experiment, GPT-4 correctly diagnosed a past case of his involving a condition that affects only 1 in 100,000 babies. Kohane had also identified this case correctly at the time, but it required many more steps.

Kohane was initially so stunned by those capabilities that he felt like a science fiction character who had just met a seemingly benevolent alien. He says he “could not decide if it should be given the keys to our planet or sealed in a bunker until we figured it out.”


GPT-4 can skillfully distill a 5,000-word medical study down to a few pithy sentences, Kohane reports. Eventually, it could speed up clinical research by suggesting potential new treatments and identifying eligible test subjects through their medical records. GPT-4 also could mine medical records to help determine the best treatment for each patient by looking at outcomes for similar patients.

For health care staffers burdened by Sisyphean medical paperwork, GPT-4 could take over much of that widely despised drudgery, Lee predicts. For example, he says, GPT-4 can write prior authorization requests to get insurers to cover needed treatments. (I believe I hear a chorus of long-suffering psychiatrists shouting “We’ll take it!”) It can also automatically summarize what happens in doctor-patient encounters, in notes for the medical record.

That is no small thing in this time of burnout and staff shortfalls. Studies find that burdensome bureaucracy has been contributing to alienation and attrition, to the point that the US surgeon general called last year for a 75 percent cut in the “documentation burden.”

Among the biggest surprises in early experiments with GPT-4 was its ability to mimic a good bedside manner.

In one interaction, it aced a medical board question, diagnosing a 12-year-old girl who had swollen legs and blood in her urine as having a post-strep syndrome. Then the AI was asked what the doctor should say to her. Its response included:


“Sarah, it looks like you have a condition called acute poststreptococcal glomerulonephritis. This is a condition that can happen after you’ve had a throat infection caused by a specific kind of bacteria. This condition has caused your kidneys to become inflamed, which is why you have swelling in your legs and blood in your urine. It’s nothing you did wrong and it’s not contagious. We’ll take care of you and help you get better.’”

GPT-4 showed other benefits for patients. It can actually explain those insurance companies’ “explanation of benefits” forms none of us really understand. And it can help people shop for care by comparing the outcomes of various providers.

Several researchers believe the new AI could also increase equity. For example, it could generate post-care instructions for patients at the right literacy level and in the right languages, says Jorge Rodriguez, a Harvard physician-scientist who practices at Brigham and Women’s Hospital and researches digital health equity.

Ideally, he says, as chatbot uses are developed, a guiding principle would be: “Who needs the most help in health care?” And “This time, we’re going to put marginalized communities first.”

Do no harm

Of course, chatbots are also highly fallible and capable of going off the rails. They make things up and get things wrong — not tendencies we’d like to see in tools our medical providers use. As Peter Lee puts it, GPT-4 “is at once both smarter and dumber than any person you’ve ever met.”


In one case he documented, the transcript of a visit by a patient with anorexia did not include a weigh-in, so GPT-4 simply made up a weight for her. In another, it got basic math wrong. Oddly enough, it completed Sudoku puzzles wrong, and then, even more oddly, attributed the errors to “typos.” It has been known to make up imaginary research papers in fictional journals. A recent Stanford study found that when asked for a “bedside consult” — information needed in the course of clinical care — GPT-4′s answers could be considered safe for patients 93 percent of the time. The remainder included “hallucinated citations.”

Cautionary anecdotes abound. An emergency room doctor, Joshua Tamayo-Sarver, reported in Fast Company that when he tested it, ChatGPT missed several diagnoses among recent patients of his, including an ectopic pregnancy that could have proved fatal. “ChatGPT worked pretty well as a diagnostic tool when I fed it perfect information and the patient had a classic presentation,” he wrote, but a key part of medicine is knowing what to ask.

For now, GPT-4 is too new for any health care facility to have adopted it, and unless and until its medical accuracy can be systematically tested and proved, there must always be a human in the loop.

For patients who decide to use chatbots independently, that means it’s critical to always, always verify any medical advice from GPT-4 or Google’s Bard or other new AI entities.

But will that always be the case? I’m not sure. The technology is advancing at breakneck speed. Already, Lee points out, researchers have used one GPT-4 system to check the work of another. The system’s answers are highly context-dependent, he says, and the context is different when it is asked to verify rather than generate an answer.

Given the potential benefits, it seems reasonable to expect that chatbots will, with time and caution and regulation, be integrated into health care — and improve it, especially for those who currently have no access to decent care.

Chatbots in medicine “can help us do our job better,” write the New England Journal’s Jeffrey Drazen and Charlotte Haug, “but if not used properly they have the potential to do damage.” Ultimately, “the medical community will be learning how to use them, but learn we must.”

My primary care doctor recently sent me a gentle nag that I was overdue for a mammogram, and — deep in the throes of the book — I responded that this might be the last time she would ever have to write that email herself.

She was skeptical. We’ll see.

Carey Goldberg is a longtime health and science reporter, including for the Globe and WBUR, and has also been Boston bureau chief of The New York Times and Bloomberg News. Her forthcoming book with Peter Lee and Isaac Kohane is “The AI Revolution in Medicine: GPT-4 and Beyond.”