Advice Given by ChatGPT Vs. Human Providers Is Nearly Indistinguishable, NYU Study Says

ChatGPT’s responses to healthcare-related questions are pretty difficult to tell apart from responses given by humans, according to a new study published in JMIR Medical Education.

The study, which was conducted by NYU researchers in January, was meant to assess the feasibility of using ChatGPT or similar large language models to answer the long list of questions that providers face in the electronic health record. It concluded that the use of LLMs like ChatGPT could be an effective way to streamline healthcare providers’ communication with patients.

To perform the study, the research team extracted patient questions from NYU Langone Health’s EHR. They then entered these questions into ChatGPT and asked the chatbot to respond using about as many words as the human provider did when they typed their answer in the EHR.

Next, the researchers presented nearly 400 adults with ten sets of patient questions and responses. They informed the participants that five of these sets contained answers written by a human healthcare provider, and the other five had responses written by ChatGPT. Participants were asked, as well as incentivized financially, to correctly identify whether each response was generated by a human or ChatGPT.

The research team found that people have a limited ability to accurately distinguish between chatbot and human-generated answers. On average, participants correctly identified the source of the response about 65% of the time. These results were consistent regardless of study participants’ demographic characteristics.

The study’s authors said that this research demonstrates the potential that LLMs have to aid in patient-provider communication, specifically for administrative tasks and managing common chronic diseases.

However, they noted that additional research is needed to explore the extent to which chatbots can assume clinical responsibilities. The research team also emphasized that it is crucial for provider organizations to exercise caution when curating LLM-generated advice to account for the limitations and potential biases of these AI models.

When conducting the study, the researchers also asked participants about their trust in chatbots to answer different types of questions using a 5-point scale from completely untrustworthy to completely trustworthy. They found that people’s trust in chatbots was highest for logistical questions — such as those about insurance or scheduling appointments — as well as questions about preventive care. Participants’ trust in chatbot-generated responses was lowest for questions about diagnoses or treatment advice.

This NYU research is not the only study published this year that supports the use of LLMs to answer patient questions.

In April, a study published in JAMA Internal Medicine suggested that LLMs have significant potential to alleviate the massive burden physicians face in their inboxes. The study evaluated two sets of answers to patient inquiries — one written by physicians, the other by ChatGPT. A panel of healthcare professionals determined that ChatGPT outperformed human providers because the AI model’s responses were more detailed and empathetic.

Photo: Vladyslav Bobuskyi, Getty Images