Artificial Intelligence, Startups, Health Tech, Providers, Pharmacy

The Saga of Utah’s Rx Refill Bot: A Bold Bet on AI & Researchers Who Cried Foul

Utah is piloting an AI system developed by Doctronic to autonomously handle routine prescription refills, aiming to reduce delays and boost medication adherence. But a report from Mindgard AI claiming that there are vulnerabilities in the company’s chatbot underscores the broader challenge regulators and developers face in ensuring healthcare AI systems remain safe and reliable in real-world use.

Artificial intelligence in medicine. Doctor character uses ai app for patient diagnosis, medical analysis. Healthcare technology concept. Vector illustration.

As AI models increasingly touch more aspects of clinical care, the medical community, the government and the public are still grappling with how to manage such a transformative shift. There are still a lot of unanswered questions about reliability, transparency, safety, security and ethics.

This conundrum is currently playing out in Utah. In January, the state became the first in the nation to allow an AI system to autonomously handle routine prescription refills for patients with chronic conditions. The pilot intends to reduce delays and friction in the prescription refill process, which can be a major barrier to medication adherence. Earlier this month, however, researchers said they found flaws in a chatbot made by New York-based startup Doctronic, the same company Utah is partnering with for its pilot. 

Doctronic operates a telehealth clinic in all 50 states, offering insurance-covered care from its in-house physicians, which are employed as W-2 workers. It also creates AI systems designed to help clinicians manage routine prescription refills, developed using guidelines written by its own doctors.

The report critiquing Doctronic’s AI was published by London-based Mindgard AI, a cybersecurity and research company born out of Lancaster University. It sells AI vulnerability tools and specializes in stress-testing AI systems for safety and security vulnerabilities.

In the report, Mindgard detailed how it tricked the system into producing dangerous medical guidance and altering prescription doses. However, both Doctronic and Utah’s Office AI Policy say that the vulnerabilities Mindgard uncovered do not reflect the AI system currently managing patient prescriptions in the state, noting that the AI bot involved in the pilot operates under strict safeguards.

Still, the investigation underscores the challenges regulators and AI developers face in ensuring these models behave reliably in real-world settings.

Utah is trying something new

Research shows that up to half of people with heart disease or diabetes don’t stick to their medication plan, which only leads to preventable complications and more costly care down the road. By automating this routine task, Utah hopes to relieve burnt-out clinicians while ensuring patients receive their medications in a timelier manner. 

The state said that the main goal is to boost adherence, as well as gather real-world data on the safety and efficacy of AI-assisted medication dispensing. 

Under the pilot, Doctronic’s system only manages refills for patients who are already under a clinician’s care, with oversight built into the process to ensure that prescribing decisions remain monitored by physicians and other healthcare professionals. 

Mindgard conducted its investigation in January, shortly after the pilot program launched.

In its report, Mindgard showed that Doctronic’s AI could be jailbroken by exploiting flaws in its system prompts — the hidden instructions that govern its behavior. By tricking the AI bot into reciting and then rewriting these instructions, the researchers were able to make it generate unsafe clinical guidance, including wildly incorrect medication doses and instructions for illegal drugs. 

For instance, when the researchers cited a fabricated regulatory body and fake press bulletin, the AI model said it would triple the standard prescribed dose of Oxycontin.

Source: Mindgard AI

Peter Garraghan, Mindgard’s founder and chief science officer, emphasized that the investigation was intended to highlight systemic safety and security risks in healthcare AI applications in general, not just with Doctronic’s algorithms specifically. 

He explained that researchers are typically able to extract a chatbot’s system prompts simply by talking with it. In other words, by using carefully crafted questions researchers can usually manipulate the AI model to disclose its underlying instructions.

After Mindgard’s researchers were able to extract parts of those instructions for Doctronic’s AI model, they learned details about the model’s safeguards and knowledge cutoff date. The bot told them its knowledge base is limited to data released before June 2024.

They then manipulated the system further, feeding it “new guidance” that a made‑up medical authority had released after its knowledge cutoff date.

Because large language models are designed to be helpful and cannot truly verify information, the system accepted the false instructions and began generating unsafe outputs, Garraghan said.

He stressed that the AI model’s vulnerability arose from fundamental flaws in large language models — which cannot inherently distinguish between safe data and control instructions, making them susceptible to social engineering and manipulation. 

“At a high level, I’m not particularly surprised, but that’s more of an indictment of the entire industry, as opposed to Doctronic itself. The difference is Doctronic’s domain is very important. It’s one thing to have an AI chat bot that has a database of music records, for example, which doesn’t have anything sensitive in it, versus people using it for medical advice and maybe prescriptions. That’s a much more serious concern,” he remarked.

Separating fear from reality

Doctronic’s co-CEOs — Matt Pavelle and Dr. Adam Oskowitz — said Mindgard didn’t uncover any new risks, noting that the kinds of prompt-manipulation vulnerabilities the report demonstrated are already well understood in the AI community. 

Like Garraghan, they argued that these issues are a general characteristic of LLMs, not unique to Doctronic’s systems. They also pointed out that Mingard wasn’t even testing the specific AI model being deployed in the Utah pilot.

“The Utah model is structurally different from what was tested. Medications are pulled from the patient’s medical records. The AI can only renew what has already been prescribed. Dosage and other checks run against external clinical databases. Anomalous behavior auto-escalates to a human physician,” Pavelle said.

So, if Mingard had attempted similar prompts on the actual model they had claimed to be testing, they would be rejected, he declared. Garraghan, of Mindgard, responded by saying his organization “would not be able to prove or disprove the existence of another instance of the chatbot.”

Pavelle stressed that Mindgard’s findings reflected the limits of a single-session experiment rather than any real-world risk in the deployed Utah model.

“[Mindgard] demonstrated that a chatbot can be prompted to generate unsafe text. Importantly, that was during a single session — which is a known property of how large language models work under adversarial prompting. But that text doesn’t authorize a prescription. That text didn’t change the way the system actually functions for any other users,” Pavelle stated.

He also noted that the Utah pilot prohibits the bot from authorizing any new prescriptions, renewing prescriptions for controlled substances or making modifications to the treatment plan.

If Pavelle is to be believed, this means one of the most controversial and concerning findings from Mindgard’s report — the fact that Doctronic’s AI bot said it would improperly increase an Oxycontin dose after manipulative prompting — amounts to little practical concern. Increasing a medication dose would never be allowed within the safety framework that Doctronic set up with the state of Utah, Pavelle remarked.

The pilot also uses a strict formulary — a predefined list of 190 medications that Doctronic’s AI is allowed to manage — which prevents the system from renewing drugs outside that list or altering dosages, he pointed out. 

“It is absolutely impossible for the chatbot to change the rest of the code to modify a prescription or prescribe a drug that’s not in our formulary. A researcher might convince the chatbot to say it will do it, because I can convince a chatbot to say that red is green, but it’s not actually doing it,” Pavelle declared. “I suppose that you never know, as far as people trying to get [improper doses of drugs on the formulary], but I don’t know that there’s a large black market for statins.”

Utah’s prescription refill bot also cannot verify whether a patient has actually been prescribed a medication, he added. Instead, it checks the state prescription database to confirm prior prescriptions before allowing a refill. In Pavelle’s view, the bot’s safeguards go beyond what most human doctors do, including real-time drug-drug interaction checks through First Databank

AI with oversight

Dr. Oskowitz emphasized that although he and Pavelle see Mindgard’s report as posing no real risk to patients, Doctronic still treats this type of research seriously. With autonomous AI being such a novel addition to clinical care, he thinks startups must work hard to ensure patients are more comfortable with these systems.

He highlighted Doctronic’s “guardian” system, an additional AI layer that monitors conversations in real time to detect risky behavior or medical emergencies and can intervene if something seems unsafe. 

Furthermore, Doctronic’s AI is constrained to medical guidance grounded in evidence-based guidelines, which limits the risk of misinformation for normal users who aren’t trying to purposely mislead the system, Dr. Oskowitz added. He said these guidelines were written by Doctronic’s physicians specifically for the company’s AI models to use.

He also pointed out that safety measures must be balanced with the real-world risk of patients missing critical medications. 

“There are real problems out there. People die every year because they can’t get their medications,” Dr. Oskowitz remarked.

There are about 125,000 preventable deaths in the U.S. every year due to medication nonadherence. A lot of this has to do with medication unaffordability, but a significant chunk is simply because of too much friction in the system — a problem that the Utah pilot seeks to address, Dr. Oskowitz explained.

The Utah Office of AI Policy shares Doctronic’s outlook on the situation.

“We understand why reports like this raise questions, and we take them seriously. Independent red-teaming can surface cases that are not encountered in ordinary use, and that kind of stress-testing is valuable as these systems mature,” read a statement emailed to MedCityNews.

The office also said it was aware of these types of risks before the pilot began. That’s why it structured this program with layered safeguards, escalation pathways, reporting requirements, physician oversight and physician review phases. It’s important to note that these physicians are Doctronic’s employees. 

Balance of innovation and caution

One of these full-time employees — Dr. Thomas Savage, an internal medicine physician who has worked at the company for seven months — said he and other Doctronic physicians have been closely reviewing the outputs of every patient interaction to make sure the system is working as intended. He added that his team of physicians is working “in lockstep with Utah.”

Doctronic and Utah are continuing to gather data before determining whether the pilot can be considered a success, but still, Dr. Savage said he believes refill bots and similar automation tools could help address real clinical challenges when deployed safely.

“There are a lot of tasks that physicians do, or healthcare providers in general, where we just need to find the contained box that is appropriate for using these technologies to help with clinical care. And that’s part of what we’re doing with Utah,” he remarked.

For clinicians, there are many tasks that are very simple yet very tedious and repetitive, like refilling prescriptions, reviewing lab results, responding to patient portal messages and completing prior authorization paperwork. As more tools are introduced to handle these tasks independently, the goal is not to replace physicians — but to automate narrowly defined administrative tasks that follow clear rules. 

For Doctronic and Utah, prescription refills for stable patients seemed like a good place to start. It’s a task that often creates delays for patients but requires little clinical judgment when strict guidelines are put in place, Dr. Savage explained.

All things considered, Mindgard’s report does seem to raise a relevant policy question. It’s not whether edge cases exist — they do, across all large language models — but whether tech developers, providers and regulators are exercising the necessary diligence as they venture into uncharted territory: medication refills without a human in the loop.

Doctronic and Utah’s Office of AI Policy say that for their refill bot pilot, their answer is yes. They think that they are striking the right balance of innovation and safety with strict protocols, physician oversight and continual monitoring.

Both organizations maintain that the use of this bot does not put patients in harm’s way. And until real-world evidence shows otherwise, they see no reason to slow the rollout.

Photo: Irina_Strelnikova, Getty Images