FDA-cleared AI devices lack critical information on performance, equity

machine learning AI

The Food and Drug Administration is greenlighting AI tools at a faster pace, from computer vision tools to flag potential cases of pulmonary embolism from CT scans, to identifying early-stage lung cancer. But critical information is missing from the FDA clearance process that could measure of how these devices actually work in patient care.

A review of 130 FDA-cleared AI devices published in Nature found that almost all of them were based on retrospective data. Most disclosures also didn’t share how many sites were used to evaluate an AI device, or whether the devices had been tested to see how they perform in patients of different races, genders, or locations.

“We really don’t know, for most of these devices that are approved, what are these potential biases — positive or negative — on clinicians because they’re not evaluated in prospective studies,” said James Zou, an assistant professor of biomedical data science at Stanford and a faculty affiliate with Stanford HAI, who was a senior researcher of the paper.

Most cleared AI devices are relatively new, and the FDA is still ironing out regulations for how they should be evaluated. Of the 150 tools evaluated by researchers, 75% of them had been cleared in the past two years, and more than half of them in the last year alone.

“How to evaluate an AI-based algorithm is very much a work-in-progress from the regulator’s perspective,” Zou said. “That’s where our research comes in — to see what information is currently being provided by these device companies and whether they’re sufficient to really evaluate the reliability of these medical AI devices. That’s where I think some of the key limitations or shortcomings of this information come in.”

Most studies were retrospective
One of the biggest limitations of the cleared devices was that the vast majority only used retrospective data, meaning most of the data had already been collected before a model was evaluated. Out of the 130 devices, 126 of them were cleared based on retrospective data.

“That’s one of the key shortcomings we were really surprised to find,” Zou said.

This is important because most of the cleared AI tools were designed to work as triage devices or support clinicians’ decisions. A prospective study would be needed to see how they actually work with physicians’ and hospitals’ data systems. For example, do clinicians ignore an alert, or rely on it too much?

“A prospective, randomized study may reveal that clinicians are misusing this tool for primary diagnosis and that outcomes are different from what would be expected if the tool were used for decision support,” the authors of the paper noted.

Different hospitals also have different ways of processing their data, which could affect a model’s performance in the real world.

Most companies didn’t report the number of test sites
Another key shortcoming is that most disclosures didn’t share how many sites were used to test the AI devices. Of the total, only 41 disclosures for FDA-cleared devices shared how many sites were used the evaluate an algorithm, and they weren’t exactly encouraging numbers. Four devices had only been evaluated in one hospital, and eight devices had only been evaluated in two hospitals.

A small number of testing sites makes a big difference in how well AI tools perform, perhaps even more so than other types of medical devices, since machine learning tools are shaped by the data used to train them.

While the number of test sites might be available to the FDA, the lack of publicly disclosed information doesn’t help clinicians, researchers or hospitals evaluating a potential tool.

It also wasn’t clear how many patients were involved in an test of an algorithm. Of the 71 devices where this information was shared, companies had evaluated the AI tools in a median of 300 patients.

“Having more thorough public disclosures of the algorithms would be beneficial to the community,” Zou said. “One thing we’re finding is right now the public disclosures provided by the FDA are quite sparse or limited.”

Differences between sites can lead to disparities in how it works with patients
Oftentimes, these disclosures only report one number for an algorithm’s overall performance. This can mask a lot of potential vulnerabilities in how a model performs across different patient populations, Zou said.

To test this out, he and his colleagues created three of their own AI models to triage potential cases of collapsed lung (pneumothorax) from x-rays, since there are currently four of these models that have been cleared by the FDA.

They pulled in images from three publicly available chest X-ray datasets from hospitals in different locations across the U.S. They used data from each site to train three different deep learning models and evaluated them on the data from the other two sites.

When the models were tested at a different site, they found substantial drop-offs in how the models performed, across the board.

These differences don’t just reflect different hospital practices, but can also reflect different patient demographics. For example, they found a significant disparity between how the models performed in white patients and Black patients when it was tested at a different site.

FDA clearance isn’t required for many clinical-decision support tools that are currently in use. MedCity News will take a closer look at what AI and triage tools hospitals are using at our INVEST conference next week.

The biggest recommendation, Zou said, would be for the companies making these algorithms to do more prospective testing.

“It wouldn’t fully address all of the challenges, but it would be a big step,” he said.

He also recommended that manufacturers test their algorithms in more sites, and more diverse sites, especially for higher risk devices.

Researchers at Stanford are also building a framework for testing AI devices after they are deployed to see how well they work. For example, if a hospital’s patient population changes substantially (say, because of a pandemic), or if the x-ray machines themselves are changed, it could affect a model’s performance.

“I don’t want to put too much of a damper on this. Our work is also developing a lot of these algorithms,” Zou said. “Part of the reason we do want to be more rigorous in evaluating them is because of their potentially transformative power.”

Photo credit: Hemera Technologies, Getty Images

Elise Reuter reported this story while participating in the USC Annenberg Center for Health Journalism’s 2020 Data Fellowship.
The Center for Health Journalism supports journalists as they investigate health challenges and solutions in their communities. Our online community hosts an interdisciplinary conversation about community health, social determinants, child and family well-being and journalism.