Lies, damned lies and AI statistics

AI,

Understanding what determines accuracy is confusing, albeit necessary when it comes to evaluating artificial intelligence in radiology.

The term accuracy is becoming increasingly ambiguous. Some report different metrics for accuracy, so it’s hard to compare. Some play around with the data selection, which skews performance. And some just use confusing metrics. As a result, AI performance can be misinterpreted and lead to erroneous conclusions.

Behavioral Health, Interoperability and eConsent: Meeting the Demands of CMS Final Rule Compliance

In a webinar on April 16 at 1pm ET, Aneesh Chopra will moderate a discussion with executives from DocuSign, Velatura, and behavioral health providers on eConsent, health information exchange and compliance with the CMS Final Rule on interoperability.

By DocuSign and MedCity News

AUC – the most damned lie of them all
One of the most common metrics used by AI companies is Area Under Curve (AUC). This way of measuring accuracy was actually invented by radar operators in World War Two, but it’s become very common in the machine learning community.

Without getting too technical, AUC measures how much more likely the AI solution is to correctly classify a positive result (say, to correctly detect a pulmonary embolism in a scan) versus how likely the same AI would be to wrongly detect something when it isn’t there.

Let me start with my personal belief: area under the curve (AUC) is a bad metric. While it is scientifically important, it’s confusing to physicians. That’s because for most clinical users, AUC is so difficult to understand and ‘weight’ appropriately. It’s grossly overemphasized. Consider the following instance:

I recently read an AI research paper showing an ‘impressive’ AUC of 0.95. Since 0.95 is close to 1 (which would perfect performance), it seems like it must be an excellent solution.

A Deep-dive Into Specialty Pharma

A specialty drug is a class of prescription medications used to treat complex, chronic or rare medical conditions. Although this classification was originally intended to define the treatment of rare, also termed “orphan” diseases, affecting fewer than 200,000 people in the US, more recently, specialty drugs have emerged as the cornerstone of treatment for chronic and complex diseases such as cancer, autoimmune conditions, diabetes, hepatitis C, and HIV/AIDS.

By PurpleLab

However, a solution with 89 percent sensitivity and 84 percent specificity could get that AUC of 0.95; so could an AI with 80 percent sensitivity and 92 percent specificity. Don’t get me wrong. Depending on the application, these could be good performance scores. However, many users would still feel cheated; they’d expect much better results overall from a solution that’s boasting it has “AUC = 0.95”.

AUC provides a single aggregated measure of how the system performs. However, real users want to know about system performance in specific environments (or working points).

For these reasons, I find AUC to be confusing for clinical users.

Sensitivity/Specificity – The Old Faithful
My favorite statistics are sensitivity and specificity: straightforward, simple and highly important in terms of usability.

Sensitivity measures how many positives cases an algorithm detects out of all positive cases. Let’s say you have 100 real brain bleed patients in your department in a week. If the AI detects 95, it means it has 95 percent sensitivity.

Likewise, specificity is the number of negative cases you accurately classified as negative out of the negative cases. This means that if you have 1,000 negative cases (in a real-world scenario, you usually have more negative cases than positive), and the AI wrongly flags 80 of them as positive; the 920 accurate negative detections means that the AI has 92 percent specificity. Together, these measures give you a good idea of how many patients could be missed by AI.

PPV and NPV – The King and Queen of Statistics
Positive Predictive Value (PPV) and Negative Predictive Value(NPV) are actually the most important statistics for assessing AI. While sensitivity and specificity are interesting from the perspective of a technical evaluation, clinically, PPV and NPV better reflect the user experience.

PPV is the number of true positive cases, out of the total cases flagged as positive. This reflects how many accurate alerts appear in the day-to-day. A PPV of 80 percent means that 8 out of every 10 alerts a user would see would be correct.

In other words, PPV is the “spam” metric reflecting how many irrelevant alerts a user would see in the day-to-day. Thus, the lower the PPV, the more “spam” (irrelevant alerts).

Let’s walk through a real-world scenario. Imagine you’re using an AI with 95 percent sensitivity and 90 percent specificity to detect c-spine fractures. Over the course of one month, your department scans 1000 cervical spine cases and out of those, 100 are deemed to be positive for fractures.

The number of true positive (TP) cases, where the AI correctly spots a fracture, would be 95 percent of 100 (95). The number of false positives (FP), where the AI thinks it’s found a fracture in a healthy patient, would be 10 percent of 900 (90). The 95 TPs and the 90 FPs make 185 positive alerts altogether. That makes PPV 95/185, so 51 percent.

Once again we need to calibrate our intuition. Our system features both high sensitivity (95 percent) and high specificity (90 percent). However, PPV is “only” 51 percent. Why?

The culprit is the data mix. Although there is a relatively low amount of false-positive cases, there is a very high number of negative cases in the first place (900 negatives vs 100 positives), meaning every percentage point in the specificity creates a huge difference in terms of the user experience.

Alternatively, Negative Predictive Value (NPV) reflects the “peace of mind” of the user: how sure you can be that if AI says the case is negative, it’s actually negative. In other words, out of all negative alerts, how many are really negative?

This time around, the fact that most patients don’t have the relevant condition actually drives the NPV up, so values of 97%+ are very common. If a ‘bad’ AI solution with sensitivity and specificity of only 80% worked on our c-spine patients, you’d still see an NPV of 97.5%. For a good system featuring 95% sensitivity and specificity, with the same data mix, NPV would be close to 99.5%.

So, while NPV and PPV are extremely helpful and important to understand, you have to adjust your expectations of what counts as a good or bad result.

In a typical setting, disease prevalence is relatively low, say 2%-15%. Depending on the exact numbers, an AI could still be great if it had a PPV in the 50%-70% range. For rarer diseases, a PPV number as low as 20% could still represent excellent performance!

NPV, should, however, be really high. You should look for an NPV of 95% or higher for reliable AI systems.

Statistics in practice
A recap on AI imaging statistics:

Area Under Curve (AUC) doesn’t give you enough information about an AI’s performance
Positive predictive value (PVV) is the spam metric. Very powerful algorithms may still score seemingly low PVV values
Negative predictive value (NPV) measures your peace of mind. Weaker algorithms still achieve very high values, so be careful.
Sensitivity tells you how many of the positive cases you’ll find
Sensitivity and Specificity should be the same in every hospital, while PPV and NPV are affected by the prevalence of a condition, so will change between hospitals and use-cases.

In summary, statistics are critical to understanding the AI world, but, as Benjamin Disraeli said it best: “There are three kinds of lies: lies, damned lies, and statistics”and it’s better you understand them deeper.

Photo: chombosan, Getty Images