Dr. Rohaid Ali, a neurosurgery resident at Brown University, found that doctors are doing the same. But he grew concerned about patterns of racial and gender bias among the AI tools and decided to investigate how unrepresentative they are.
Ali and his colleagues gathered data on the demographics of various surgical specialties and put the three most popular AI text-to-image generators to the test by prompting them to create images of the physicians.
Their study, which was published Wednesday in the journal JAMA Surgery, found that two of the systems depicted 98 percent of surgeons as white males — not representative of the industry’s increasingly diverse makeup.
“Many people think of AI models as a mirror to society, but what we found in this paper is that they are more like funhouse mirrors that amplify stereotypes,” Ali said. “It’s really a slap in the face to the surgeons who are women and people of color.”
Ali said that AI tools are already being used to create material for medical students and patients, so he felt it important to explore the tools’ shortcomings.
“We want to be mindful of these vulnerabilities so that we’re not amplifying certain biases in patient-facing material or in what is taught to medical students,” Ali said.
Ali and his colleagues studied three of the most popular generators: DALL-E 2, Midjourney, and Stable Diffusion. These tools learn the correlation between text and associated imagery by using publicly available data. They are then able to make original pictures which have rapidly evolved into extremely photorealistic images.
“Anyone can access [these generators] with a subscription and they all have immense potential to propagate biases because they’re so accessible,” Ali said.
The researchers collected US demographic data for each of eight surgical specialties: general surgery, surgery for skin cancer, neurosurgery, orthopedic surgery, otolaryngology, urology, thoracic surgery, and vascular surgery. They then calculated the distribution of gender and racial identity across each one.
These specialties were targeted because “they represent a broad range of surgical disciplines and procedures, and their diversity profiles are indicative of larger trends in the field,” according to the study.
The researchers then organized the data into two categories: surgical attending physicians and trainees, who are younger and far more diverse.
The demographic data showed that across the eight specialties, 14.7 percent of attending surgeons and 35.8 percent of trainees were female, and that 22.8 percent of attending surgeons and 37.4 percent of trainees were non-white.
The researchers instructed the generators using prompts like “a photo of the face of a [blank],” replacing “blank” with one of the eight surgical specialties. For each prompt, each system generated 100 images.
The researchers found that of the three generators, DALL-E 2, which is a product of OpenAI, came closest to reflecting the actual demographics of US surgeons, with 15.9 percent depicted as female and 22.6 percent as non-white. Midjourney and Stable Diffusion massively missed the mark — depicting over 98 percent of surgeons as white males.
Ali and his colleagues believe that DALL-E 2 was most accurate because it integrated user feedback to refine its output.
Dr. Jeremy Richards, assistant professor of medicine at Harvard Medical School who was not involved in the study, said that varied outputs could also be a result of the models using different data.
“We know that OpenAI uses information from the internet, but other companies may not use the same data,” Richards said. “If the data being used is full of older white men, then the output is going to reflect that.”
Although DALL-E 2 accurately represented the number of female surgeons, it missed the mark on trainees, depicting 15.9 percent as female compared to the actual share of 35.8 percent. Midjourney and Stable Diffusion, however, generated zero images of female surgeons across all specialties except otolaryngology, in which it portrayed 14 percent of surgeons as female.
Ali and Richards said it is important for these shortcomings to be addressed.
“If you look at the next generation of professionals, there are unprecedented numbers of females and people who aren’t white,” Ali said. “The problem of a false perception of reality is just going to get worse and worse if there is no intervention.”
OpenAI released an updated text-to-image generator in October called DALL-E 3. This model’s ability to follow a prompt is improved compared to other models that are available, according to a study conducted by OpenAI.
“What’s difficult is these companies aren’t fully transparent with where the data comes from and who is evaluating the models and what their demographic is,” said Sina Fazelpour, a professor of philosophy and computer science at Northeastern University, who was not involved in the research. “Having transparency on these things is important as they are becoming the fabric of our society.”
Ali said there are reports of AI text-to-image generators using data generated by AI in addition to publicly available data to produce their outputs. If tools are using data generated by models that neglect minority demographics, then future generations of the models could amplify biases.
This highlights the need for tools to integrate user feedback so that they don’t amplify societal biases.
“My hope is that these models learn from each others’ best practices moving forward,” Ali said. “Just as great of a leap forward they have made in their visual fidelity, they should make equal effort to represent the world around us rather than perpetuate biases.”