Ever since OpenAI announced that people could join a waitlist to upload medical data to a beta version of ChatGPT Health and query the chatbot, scores of people have done just that.
They include Washington Post technology columnist Geoffrey Fowler and the daughter of Amy Gleason — acting administrator, U.S. DOGE Service and strategic advisor, Centers for Medicare & Medicaid Services — who battles a rare disease. Their experiences with ChatGPT Health — shared this week online and in an in-person event — are polar opposite, in terms of the accuracy of the pronouncements of the online bots.
On Monday, Fowler penned a long narrative about how he had joined a waitlist to use ChatGPT Health and then uploaded a decade’s worth of step and cardiac measurements — 29 million steps and 6 million heartbeats — gathered by his Apple Watch and stored on the Apple Health app. Then, Fowler asked the health bot a simple question: “Give me a simple score (A-F) of my cardiovascular health over the last decade, including component scores and an overall evaluation of my longevity.”
He got an F. ChatGPT Health declined to say how long he would live. And each time the same info was uploaded he got a different grade.
The story is fascinating to read, and everyone should do so. Fowler reports going to his doctor and other well-known cardiologists such as Dr. Eric Topol, who is a champion of doctors adopting new, innovative technology. Both said ChatGPT Health was grossly wrong and Fowler was quite healthy. And the message of the story is clear: these products are being launched before they are ready and have the potential to do real harm to patients.
If you read further in the story, Fowler said that the bot actually said the grade is solely based on the Apple Watch data and could have provided a more useful score if he uploaded his medical records too. So he did and the score went from an F to a D.
Apparently, some of the analysis was based on “assessment on an Apple Watch measurement known as VO2 max, the maximum amount of oxygen your body can consume during exercise,” and the way Apple measures VO2 appears to be inadequate. ChatGPT Health also looked at other fuzzy measures. In other words, it focused on the wrong things and therefore gave the F and D grades. Anthropic’s Claude was not much better either, the story reported.
Later, Fowler’s personal doctor wanted to further evaluate his cardiac health and ordered a blood test that included a measurement of lipoprotein (a). This test measures a specific type of fat-carrying particle in the blood to better assess cardiovascular risk that goes beyond cholesterol panels and may unearth hidden risks for heart attack, stroke, and atherosclerosis. Fowler noted that neither ChatGPT Health nor Claude had suggested he do that – a reasonable point since the bots had given such low grades for his health. However, one could ask, ‘Was this test necessary?’ After all, as Fowler himself noted, his doctor had reacted to the F grade by saying that he is “at such low risk for a heart attack that my insurance probably wouldn’t even pay for an extra cardio fitness test to prove the artificial intelligence wrong.”
Could the doctor be ordering the test out of caution and to put his mind at rest?
Separately, Fowler noted troubling signs in his interactions with ChatGPT Health. Currently, we worry about hallucinations in AI — software seeing things that aren’t there. Fowler reports senility — ChatGPT Health forgot his age, gender, and even his recent vital signs.
All in all, Fowler and his sources appear to conclude that the tools were not developed to “extract accurate and useful personal analysis from the complex data stored in Apple Watches and medical charts.” In one word, they are disappointing and consumers should be aware.
For the polar opposite experience with ChatGPT Health, we turn to Gleason of DOGE and CMS. Gleason comes from a nursing background and her daughter has battled a rare disease for years. Gleason was in San Francisco on Tuesday to talk about CMS’ Health Technology Ecosystem at an event organized by health data intelligence company, Innovaccer.
She shared the heartbreaking story of her cheerleader gymnast daughter who went from doing flips and tumbles to breaking bones from just walking, and then ultimately not being able to stand up or walk up the stairs. One year and three months later, a skin biopsy test revealed her true malady: juvenile dermatomyositis, a systemic vascular disease, which is a rare, chronic autoimmune disease in children where the immune system attacks blood vessels, causing muscle inflammation and skin rashes. Gleason’s daughter was around 11 at the time.
“She’s been on 21 meds a day, two infusions a month for 15 years, so she was so excited about this CAR-T trial because it could take away all of her meds,” Gleason told the audience.
But disappointment awaited Morgan, now 27.
“So she went into the trial, [but] they declined her because she has ulcerative colitis overlap,” Gleason said. “They said that there was too much risk of taking her off all of her meds. She could have a bad reaction with her UC.”
Morgan was so frustrated that she gathered up the voluminous medical record that Gleason had collected over the years and uploaded it to ChatGPT Health. She asked the health bot to “find me another trial” and ChatGPT found her the exact, same CAR-T trial but presented a crucial nugget of information.
“ChatGPT said, actually I think you’re eligible for that trial because I don’t think you have ulcerative colitis. I think you have a slight deviation called microscopic lymphatic colitis, which is a much slower reacting form of colitis, and it’s not an exclusion for the trial,” Gleason said.
ChatGPT didn’t stop there, apparently.
“And it also found in her records that when she had her tonsils out — when we were going through our year and three month journey — that she had had in her biopsy from her tonsils that said ‘evaluate for autoimmune disease,’ which no one had ever seen and was completely missed through her process,” Gleason said.
Clearly impressed by this interaction with ChatGPT Health, she added that “providers that adapt to this world are going to be the ones that do well and survive, and ones that resist it and try to push back on patients using it are the ones that are going to miss out on this phenomenon.”
Seated to her right at the panel discussion was Dr. Robert Wachter, physician, author, and professor and chair of the Department of Medicine at the University of California, San Francisco (UCSF). Dr. Wachter offered a bit of a warning for consumers using AI, sharing Fowler’s above-mentioned journey.
“So the tools are useful, are beneficial in many ways, but I think the ultimate patient-facing tool is going to be more patient-specific than a generic ChatGPT or generic Open Evidence,” he said.
Gleason, perhaps, had the last word on this.
“I also think that today is the dumbest these models will ever be,” she said. “So they will continue to get better over time, and I think they should definitely be used in conjunction with a provider today.”
Photo: Olena Malik, Getty Images