Prompting is Not Clinical Practice: The Limits Of General LLMs in Healthcare

Large language models (LLMs) such as ChatGPT have generated significant excitement across healthcare. Their ability to summarise medical literature, answer clinical questions and perform well on structured diagnostic tasks has raised the possibility that quite soon LLMs could take the role of clinical assistants.

It’s true that in some contexts, LLMs are already competent in aspects of this role. Research shows that models can perform strongly on medical exams, disease classification tasks and narrow diagnostic prompts. In our own peer-reviewed work, published in the Journal of the American Medical Informatics Association, we observed high scores for AI in disease identification across conditions, including chronic obstructive pulmonary disease (COPD) and chronic kidney disease (CKD), as well as reasonable diagnostic accuracy when models were prompted carefully.

But clinical care does not happen in constrained settings.

The New Blueprint: How Clever Care Health Plan is Scaling Its Member Experience [Video]

MedCity News was at the Vive conference and spoke with executives who shared their insights for the healthcare industry.

By Clever Care Health Plan and MedCity News

The difference between a powerful language model and a true clinical assistant is not accuracy on isolated prompts; it is architecture.

From impressive demos to clinical reality

General-purpose LLMs rely on what a user provides in a prompt. In practice, this means clinicians must manually summarise a patient’s history and formulate the right question.

This dependence on prompting is not a minor usability issue. A recent University of Oxford study found that participants using chatbots for medical advice received inconsistent and sometimes misleading responses, often because they didn’t know what to ask or failed to share critical context. Outcomes varied based on phrasing, leaving users to interpret conflicting suggestions.

Clinical care, however, cannot depend on perfectly framed prompts. Relevant signals are often scattered across years of electronic health record (EHR) data, including structured fields, unstructured notes, lab trends, referrals, and treatment histories. A clinician may not know which detail matters until it is surfaced.

A clinical assistant must therefore retrieve and reason over the full patient record automatically and proactively. Clinical reasoning depends on longitudinal memory – not just access to data, but the ability to reconstruct a patient’s trajectory over time.

Medicine runs on guidelines, not probabilities

LLMs are trained on broad medical knowledge and generate answers based on linguistic and statistical probability. But healthcare delivery requires deterministic application of specific clinical guidelines. Care pathways differ by disease, stage, biomarker status and evolving guideline recommendations. In practice, the central question is not “what might this be?” but rather “is this patient receiving guideline-concordant care?”

That requires pre-configured, disease-specific logic grounded in current clinical standards rather than generic medical reasoning compiled using language patterns.

Interpretability is equally essential. If an AI system flags a care gap, clinicians must see exactly which lab value, pathology note or referral date supports that conclusion. Transparent reasoning and traceability to source evidence must be prerequisites for trust, governance and safe clinical action rather than optional features to be considered.

One patient at a time is not a health system

Conversational AI also operates case by case: one patient, one prompt, one response.

Health systems, in contrast, must identify care gaps across entire populations. They need to know which patients have missed biomarker testing, which have not received first-line therapy and where inequities are emerging. Manual prompting, one patient at a time, is therefore not enough; this requires automated assessment at scale.

A practical illustration can be seen in the UK’s NHS where AI has identified cancer cachexia patients systematically missed by traditional ICD-code searches. Rather than generating suggestions, the task was to reason across years of unstructured notes and laboratory data to assess guideline concordance across populations. Crucially, conclusions were traceable to source documentation within the EHR and delivered within routine workflows. This is clinical infrastructure, in contrast to conversational assistance.

Similar patterns have emerged across other conditions. In CKD, verified results from our platform identified individuals at risk with 97% sensitivity and 85% precision by reasoning across longitudinal laboratory trends and clinical notes based on KDIGO guidelines. In COPD, our platform detected patients at high risk of exacerbations with 99% sensitivity based on GOLD guidelines, enabling earlier intervention. Additionally, in metastatic breast cancer, our platform found the 51% patients who were missed for guideline-directed biomarker testing. Furthermore, of the patients who did receive guideline-directed biomarker testing, our platform found 58% and 36% of patients who were missed for first- and second-line therapies, respectively. These outcomes were not driven by conversational prompts, but by systems designed to emulate clinical reasoning at scale.

General-purpose LLMs have meaningful value in healthcare – they can support education, documentation and research – as well as lower barriers to knowledge and help improve communication.

But acting as a clinical assistant embedded in care delivery requires more – it needs longitudinal EHR grounding, explicit encoding of clinical guidelines, transparent and traceable reasoning and the ability to operate securely at a population scale.

Recognising the distinction that these are architectural design requirements, as opposed to usability refinements, is the foundation for deploying AI safely, responsibly and effectively in real clinical environments.

Author bios:

Vibhor Gupta has a breadth of experience in life sciences through his work in industry and academia over the last two decades. Prior to Pangaea, Vibhor started and built the European business for Quantum Secure, which was an enterprise software solutions provider headquartered in Silicon Valley and was successfully acquired by a global corporation in 2015. Following his work at Quantum Secure, Vibhor served as Senior Vice President of Commercial Strategy and Sales at Seven Bridges Genomics, which was founded at Harvard and provided a cloud based bioinformatics platform. Vibhor’s academic career focused on conducting molecular biology studies and building bioinformatics tools and machine learning models with epigenetic, genomic, transcriptomic and clinical trial data in the context of oncology and infectious diseases. Vibhor has access to an extensive global network in the Life Sciences industry and is regularly invited to speak at international conferences, government funded programs and investment summits.

Jingqing Zhang is the Head of AI at Pangaea Data Limited. He earned his PhD in Computing from Imperial College London in 2022, under the supervision of Prof. Yi-Ke Guo. His research interest is in Language Modelling, Natural Language Processing, Text Mining, Data Mining and Deep Learning, with an emphasis on their applications in healthcare. Jingqing is the founder and main contributor to popular deep learning tools, such as TensorLayer 2.0 and PEGASUS. His projects have received over 10,000 stars on GitHub and millions of downloads on HuggingFace. He was awarded National Scholarship of China (top 1%) in 2015.

Image: Thai Noipho, Getty Images