Health Tech, Artificial Intelligence, Providers

How Often Do LLMs Hallucinate When Producing Medical Summaries?

Researchers at the University of Massachusetts Amherst released a paper this week showing that large language models tend to hallucinate quite a bit when producing medical summaries.

The University of Massachusetts Amherst and healthcare AI company Mendel released a paper this week exploring how often large language models tend to hallucinate when producing medical summaries.

Over the past year or two, healthcare providers have been increasingly leveraging LLMs to alleviate clinician burnout by generating medical summaries. However, the industry still has concerns about hallucinations, which occur when an AI model outputs information that is false or misleading.

For this study, the research team collected 100 medical summaries from OpenAI’s GPT-4o and Meta’s Llama-3 — two up-to-date proprietary and open-source LLMs. The team observed hallucinations in “almost all of the summaries,” Prathiksha Rumale, a UMass graduaate student nd one of the study’s authors, said in a statement sent to MedCity News.

In the 50 summaries produced by GPT-4o, the researchers identified 327 instances of medical event inconsistencies, 114 instances of incorrect reasoning and three instances of chronological inconsistencies. 

The 50 summaries generated by Llama-3 were shorter and less comprehensive than those produced by GPT-4o, Rumale noted. In these summaries, the research team found 271 instances of medical event inconsistencies, 53 instances of incorrect reasoning and one chronological inconsistency. 

“The most frequent hallucinations were related to symptoms, diagnosis and medicinal instructions, highlighting the fact that medical domain knowledge remains challenging to the state-of-the-art language models,” Rumale explained.

Tejas Naik, another one of the study’s authors, noted that today’s LLMs can generate fluent and plausible sentences, even passing the Turing test

presented by

While these AI models can speed up tedious language processing tasks like medical record summarization, the summaries that they produce could potentially be dangerous, especially if they are unfaithful to the source medical records, he pointed out. 

“Assume a medical record mentions that a patient had a blocked nose and sore throat due to Covid-19, but a model hallucinates that the patient has a throat infection. This may cause the medical care professionals to prescribe wrong medicines and the patient to overlook the danger of infecting the elderly family members and individuals with underlying health conditions,” Naik explained.

Similarly, an LLM might overlook a drug allergy that is documented in a patient’s record — which may lead a doctor to prescribe a drug that could result in a severe allergic reaction, he added.

The research suggests that the healthcare industry needs a better framework for detecting and categorizing AI hallucinations. This way, industry leaders can better work together to improve the trustworthiness of AI in clinical contexts, the paper stated.

Photo: steved_np3, Getty Images

Editor’s note: A previous version of this story failed to mention that the study was published jointly by the University of Massachusetts Amherst and Mendel.