Research Shows Generative AI In The EHR Can Work Well, But Only With Human Oversight

As the burden of documentation and various other administrative duties has increased, physician burnout has reached historical levels. In response, EHR vendors are embedding generative AI tools to aid physicians by drafting their responses to patient messages. However, there is a lot that we don’t yet know about these tools’ accuracy and effectiveness.

Researchers at Mass General Brigham recently conducted research to learn more about how these generative AI solutions are performing. They published a study last week in The Lancet Digital Health showing that these AI tools can be effective at reducing physicians’ workloads and improving patient education — but also that these tools have limitations that require human oversight.

For the study, the researchers used OpenAI’s GPT-4 large language model to produce 100 different hypothetical questions from patients with cancer.

The researchers had GPT-4 answer these questions, as well as six radiation oncologists who responded manually. Then, the research team provided those same six physicians with the GPT-4-generated responses, which they were asked to review and edit.

The oncologists could not tell whether GPT-4 or a human physician had written the responses — and in nearly a third of cases, they believed that a GPT-4-generated response had been written by a physician.

The study showed that physicians usually wrote shorter responses than GPT-4. The large language model’s responses were longer because they usually included more educational information for patients — but at the same time, these responses were also less direct and instructional, the researchers noted.

Overall, the physicians reported that using a large language model to help draft their patient message responses was helpful in reducing their workload and associated burnout. They deemed GPT-4-generated responses to be safe in 82% of cases and acceptable to send with no further editing in 58% of cases.

But it’s important to remember that large language models can be dangerous without a human in the loop. The study also found that 7% of GPT-4-produced responses could pose a risk to the patient if left unedited. Most of the time, this is because the GPT-4-generated response has an “inaccurate conveyance of the urgency with which the patient should come into clinic or be seen by a doctor,” said Dr. Danielle Bitterman, who is an author of the study and Mass General Brigham radiation oncologist.

“These models go through a reinforcement learning process where they are kind of trained to be polite and give responses in a way that a person might want to hear. I think occasionally, they almost become too polite, where they don’t appropriately convey urgency when it is there,” she explained in an interview.

Moving forward, there needs to be more research about how patients feel about large language models being used to interact with them in this way, Dr. Bitterman noted.

Photo: Halfpoint, Getty Images