MedCity Influencers, Artificial Intelligence

Sensitivity vs. specificity: The eternal AI debate

Which is more important when working with healthcare AI detection: that it never misses something on a patient scan, or that it never identifies something that isn’t there?

Which is more important when working with healthcare AI detection: that it never misses something on a patient scan, or that it never identifies something that isn’t there? In other words, which should you be more concerned with: AI sensitivity or specificity?

As one comes at the expense of the other, the question should be what the right balance is of the two. But if you asked a 100 people, I believe that most of them would say sensitivity holds the utmost importance. Indeed, missing something critical on a scan could lead to a disaster. Depending on how a particular AI solution fits into the healthcare pipeline, one miss could cause a patient their life. On the other hand, if an AI system flags a false abnormality, it could cause extra expenses on unnecessary tests. Would this be so bad?

Radiologists know that there’s no simple answer regarding the importance of high sensitivity or specificity. The specific use case, the role of the solution in the healthcare journey, and the prevalence of a disease detected all have an impact on what makes ‘good’ specificity. Ultimately, AI accuracy must be tailored to the specific use-case and pathology, in order to truly provide value ‘in the wild.’

Specificity is compounded

It may be helpful to define some terms here. A test’s sensitivity portrays how many positive cases are detected out of total pool of positive cases. If you have 100 patients with pulmonary embolisms and an AI system spots 95, then we’d say it has a 95% sensitivity rate. Specificity is how often a test comes back as negative when it actually is negative. If a hospital’s AI examines 1,000 healthy patient scans for pulmonary embolism and wrongly detects 100 embolisms, the solution has specificity of 90% (out of 1000 healthy cases 900 were detected or (1000-100)/1000 = 90%).

How do we define a “good” specificity? Well, that depends on the prevalence of the condition you’re trying to detect. Imagine an AI algorithm aimed at detecting male scans. In a group of 100 male scans and 100 female scans, a 100% sensitivity, 95% specificity algorithm would wrongly classify 5 of the female scans as male. Not ideal, but not terrible either. Indeed, out of 105 scans flagged as male, 100 would be correct. Hence, with little post processing, perfect results can be achieved.

On the other extreme, consider an algorithm designed to find rare brain tumors that only show up for every 100,000 patients. Now, 100% sensitivity and 95% specificity have a very different meaning. In a group of 100,000 patients, the algorithm would detect the rare tumors in only 5,000 scans. But almost ALL (roughly 4,999 of 5,000) the detections would be false positives. As a result, the physician verifying the AI results may feel that the system is inaccurate and unreliable.

There are false positives and there are false positives…

However, not every false positive is bad. Consider, for example, an AI solution designed for detecting brain bleeds. It may raise an alarm when encountering an irregular structure such as a brain tumor.

Technically, that’s a false positive, but the AI did find a critical abnormality in the scan and flagged it to the clinicians. So, the impact remains positive.

On the other hand, if the AI mistook the skull itself for a brain bleed, it would flag excess noise — and it would be a bit embarrassing, to be honest.

The cost of an alert

Every false positive has a cost. The most immediate cost is time, as a physician looks for a tumor or embolism that isn’t on the scan. If the false positive indicates an urgent condition, such as a stroke, it can interrupt the physician’s workflow, resulting in a forced context switch.

But, false positives can be worse. For an on-call physician at home to be woken up by an alert, to rush to the hospital, and assess the patient only to discover that the alert was false, the cost could be massive.

The problem is further exacerbated when several AI systems run in parallel on a single patient study. With the old Computer-Aided Diagnosis tools, the FP/S rate (false positive per scan) was higher than 1. That means that every exam had at least one finding (on average).

Using such systems for analyzing modern CT studies including over thousands of scans would be completely untenable. With modern, deep-learning based AI, one can reduce the FP rate to a fraction of studies. Even so, assuming that we will eventually be running tens of algorithms in parallel on average, every study would have at least one false-positive detection for some condition. At that point, the AI could be giving physicians more work than what they can handle.

The future of AI

As imaging-AI algorithms mature into integrated workflow solutions, both AI solutions and hospitals are entertaining methods  to support accurate and efficient clinical workflow.

AI detection for incidental findings hold great potential to make a tremendous impact on patient care, allowing, for example, incidental pulmonary embolisms to be found in low-priority outpatient oncology scans. Today, many of these cases are overlooked. However, incidental findings tend to have a low prevalence (e.g 1-3% for pulmonary embolism in oncology scans), risking a high level of false positives. Accordingly, incidental detection algorithms must feature both high sensitivity AND high specificity. Indeed, incidental detections have only become clinically viable in the past year or so, following recent accuracy advances.

The future of clinically successful healthcare AI relies on robust accuracy. Increasing specificity from 90% to 95% might not seem like a lot, but it amounts to cutting false positives (and false alerts) by two-fold. Fortunately, with larger CNNs, more data and better models, AI is becoming more accurate each day.

But raw accuracy is only half of the story. The other half is workflow optimization. Physicians need to have, at their disposal, effective tools allowing them to manage their AI alerts. This is crucial so that they aren’t overwhelmed by low-priority false positives while also ensuring that critical true positive alerts are attended to.

In an effort to resolve the dilemma posed by AI sensitivity and specificity, the addition of an integrated platform in which many AI algorithms run in parallel can be envisioned. This would require a user driven AI control system that evaluates the relative importance of each alert, from all the AI subsystems, and displays them in the optimal manner for each user at each point in time. With this type of capability, healthcare AI will move fundamentally closer to providing truly comprehensive decision-support.

Photo: mrspopman, Getty Images

This post appears through the MedCity Influencers program. Anyone can publish their perspective on business and innovation in healthcare on MedCity News through MedCity Influencers. Click here to find out how.