MedCity Influencers

The Hidden Dangers of AI Confidence Scores in Healthcare

AI can serve as an invaluable tool, but blindly trusting “confidence scores” creates serious risks. Here are some better alternatives.

AI innovation and adoption is booming in healthcare, from diagnostic tools to personalized medicine. While healthcare leaders are optimistic, IT leaders I’ve spoken with are less certain. When lives are on the line, how can you know whether or not an AI tool produces trustworthy results? 

Recently, some groups have recommended confidence scores as a way to measure AI’s reliability in healthcare. In the context of AI, confidence scores often stem from approximations rather than validated probabilities. Particularly in healthcare, large language models (LLMs) might produce confidence scores that don’t correspond to true likelihoods, which can create a misleading sense of certainty. 

In my opinion, as a healthcare tech leader and AI enthusiast, this is the wrong approach. AI can serve as an invaluable tool, but blindly trusting “confidence scores” creates serious risks. Below, I’ll outline what these risks are and suggest what I believe are better alternatives so you can use AI without compromising your organization’s work.

Confidence scores explained in the context of AI

Confidence scores are numbers meant to show an AI tool’s certainty about an output, like a diagnosis or a medical code. To understand why healthcare users should not trust confidence scores, it’s important to explain how the technology works. In AI, confidence scores usually come from a statistical confidence interval. This is a mathematical output that calculates the probability that an AI output is accurate based on its training model. 

These show up often in other forms of technology. Think of a dating app that gives users a match score, for example. Seeing these scores in everyday life can easily mislead someone into thinking they’re dependable and appropriate for other contexts, like healthcare.

For clinicians who pull up generative AI summaries in a patient’s chart, for example, a displayed confidence score can imply false certainty, leading to unintended errors if they trust this technology over their own judgment.

I believe that including these scores on a healthcare platform poses too great a risk. I’ve chosen not to display confidence levels in the AI solutions I design because I believe they can discourage users from thinking critically about the information on their screens. This is especially true for users who aren’t trained in analytics or aren’t familiar with the mechanics of AI or ML. 

A flawed approach for grading AI output

AI confidence scores often appear as percentages, suggesting a certain likelihood that a code or diagnosis is correct. However, for healthcare professionals not trained in data science, these numbers can seem deceptively reliable. Specifically, these scores pose four significant risks:

  1. Misunderstanding of context – Out of the box, AI workflows contain only population-level training — not a provider’s specific demographic. This means an off-the-shelf AI tool doesn’t account for the clinician’s population or local health patterns, and a confidence score will reflect a broad assumption rather than tailored insights. This leaves clinicians with an incomplete picture.
  2. Overreliance on displayed scores – When a user reads a 95% confidence score, they may assume there’s no need to investigate further. This can oversimplify data complexities. At worst, it encourages clinicians to bypass their own critical review or miss nuanced diagnoses. Automation bias, a phenomenon where users over-trust technology outputs, is particularly concerning in healthcare. Studies indicate that automation bias can lead clinicians to overlook critical symptoms if they assume an AI’s confidence score is conclusive.
     
  3. Misrepresentation of accuracy – The intricacies of healthcare don’t always match statistical probabilities. A high confidence score might match population-level data, but AI can’t diagnose any particular patient with certainty. This mismatch can create a false sense of security.
  4. False security generates errors – If clinicians follow an AI recommendation’s high scores too closely, they might miss other potential diagnoses. For example, if the AI suggests high confidence in a particular code, a clinician might skip further investigation. If that code is incorrect, it can cascade through subsequent care decisions, delaying critical interventions or creating a billing mistake in a value-based care contract. These mistakes compromise trust, whether it’s a platform user that becomes wary of AI or an insurance biller who questions incoming claims.

A better way of helping users understand the strength of AI output 

Localized data and knowledge of how an end user will interact with AI tools enables you to tailor AI to work effectively. Instead of relying on confidence scores, I recommend using these three methods to create trustworthy outputs:

  1. Localize and update AI models often – Tailoring AI models to include local data—specific health patterns, demographics, and evolving health conditions—makes the AI output more relevant. There’s a higher percentage of patients with Type II Diabetes in Alabama, for example, than there are in Massachusetts, and an accurate output depends on timely, localized data that reflects the population you serve. Knowing what data is fed into a model and how it’s developed and maintained is a necessary part of a user understanding an AI output. Continuously training and updating models with fresh data ensures they reflect current standards and discoveries, avoiding reliance on outdated data. Regular retraining and audit processes are crucial. By updating an AI model with current, localized data, healthcare organizations can reduce the risk of confidence scores that don’t reflect real-world dynamics.
  2. Thoughtfully display outputs for the end user – Consider how each user interacts with data and design outputs to meet their needs without assuming that “one size fits all” works for everyone. In other words, outputs need to match the user’s perspective. What is meaningful for a data scientist is different from what’s meaningful to a clinician. Instead of a single confidence score, consider showing contextual data, such as how often similar predictions have been accurate within specific populations or settings. Comparative displays can help users weigh AI’s recommendations more effectively.
  3. Support, but don’t replace, clinical judgment – The best AI tools guide users without making decisions for them. Use stacked rankings to present a range of diagnostic possibilities with the strongest matches on top. By ranking possibilities, clinicians have options to consider and rely on their professional judgment to make a decision rather than automatic acceptance.

Clinicians need tech tools designed to support their expertise and discourage blind reliance on confidence scores. By blending AI insights with real-world context, healthcare organizations can embrace AI responsibly building smoother workflows and, most importantly, safer patient care.

Photo: John-Kelly, Getty Images

Brendan Smith-Elion is VP, Product Management at Arcadia. He has over 20+ years in the healthcare vendor space. His passion is product management, but he also has experience in business development and BI engineer roles. At Arcadia, Brendan is dedicated to driving transformational outcomes for clients through data-powered, value-focused workflows.

He started his career at Agfa where he led the cardiology PACS platform before moving onto a startup, Chartwise, focused on clinical document improvement. Brendan also spent time at athenahealth where he led efforts to develop provider workflows for meaningful use, quality measures, specialty workflows, and clinical microservices for ordering and a universal chart service. His most recent role prior to article was at Alphabet/Google working on a healthcare data platform for the Verily Health Platform teams working on data products for payer and provider preventative disease management.

This post appears through the MedCity Influencers program. Anyone can publish their perspective on business and innovation in healthcare on MedCity News through MedCity Influencers. Click here to find out how.