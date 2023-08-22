The study , published Tuesday in the Journal of Medical Internet Research, found that ChatGPT was about 72 percent accurate in overall clinical decision making for patients — from arriving at a final diagnosis to coming up with treatment plans. While there are no formal benchmarks, researchers estimate that such passing performance is on par with a new doctor, known as an intern or resident.

Artificial intelligence is nearly as good as a recent medical school graduate at making clinical decisions, but struggles in key areas that show it won’t be replacing the doctor anytime soon, according to a new study by Mass General Brigham.

However, the program struggled to come up with an accurate list of initial diagnoses based on early information about a patient, critical work that informs what tests should be run, which is he bedrock of a physician’s practice.

“The big takeaway is the chatbot performs well in some scenarios but not all. And you have to be cautious in applying it,” said corresponding author Dr. Marc Succi, associate chair of innovation and commercialization at Mass General Brigham Enterprise Radiology and strategic innovation leader at Mass General Brigham.

The study comes as generative artificial intelligence is spreading into multiple sectors, including higher education, the corporate world, and everyday life.

Forms of AI have been in use by health systems for years, but hospitals are slowly starting to adopt more powerful forms of the technology, including “large language model” AI, which uses powerful analytics to convincingly generate human-like text. Versions of the technology, which power platforms such as ChatGPT, GPT-4, and Google’s Bard, have produced astonishing results, as they have been trained on enormous data sets and are run by more powerful computers.

While some clinicians have begun experimenting with the technology, health systems for the most part have limited the use to administrative work in hospitals, as they study and test the technology on the clinical side.

This study adds further insight to that adoption. John Brownstein, chief innovation officer at Boston Children’s Hospital, who was not involved in the study, said the results show the technology is promising, but that it is not adept enough to be used in isolation, and requires important guardrails.

“It highlights the fact that while these tools can be useful, they aren’t a silver bullet,” he said.

In the study, Mass General Brigham researchers asked ChatGPT to look at 36 clinical vignettes. The bot was asked to come up with a set of possible diagnoses based on the initial information provided by the patient — including age, gender, symptoms, and whether the case was an emergency. The program was given additional pieces of information, and asked to make decisions on possible tests. Eventually the program was asked to give the final diagnosis and treatment plan.

The team compared the program’s accuracy at each stage of the diagnostic and treatment process, finding it made the correct final diagnosis 77 percent of the time. However initially, its list of possible diagnoses were accurate only 60 percent of the time. It was also 68 percent accurate in the medical management of a patient, such as figuring out what medications to prescribe.

Notably, researchers said ChatGPT didn’t show gender bias and that its results were equal in both primary care and emergency settings.

While the results were promising, researchers said more research is needed to assess the foundational accuracy of such bots.

“All these tools physicians and health care providers use, like a stethoscope, are augmenting tools to make us more efficient. This is another augmenting tool,” Succi said. “AI ultimately augments the health care provider, but it doesn’t replace the provider.”

