,

Artificial Intelligence in Clinical Decision Support: What It Has and Hasn't Achieved

AI-based clinical decision support has improved discrimination and process measures, yet prospective patient-outcome evidence, low false-alarm burden, and reliable human-AI collaboration remain the field's unresolved bottlenecks.

By Cem Akaltun, MD · Updated · ~12 min read Clinical AI & LLMs

Clinical decision support systems (CDSS) aim to guide clinicians in diagnosis, risk prediction, and treatment management by processing data from the electronic health record. Over the past two years, the entry of artificial intelligence and large language models (LLMs) has expanded the field rapidly. Yet the 2025-2026 evidence paints an honest and sobering picture: model discrimination has improved markedly, but how far this translates into patient outcomes (mortality, morbidity), how manageable the alarm burden has become, and whether clinicians and AI actually form a good team all remain largely open questions. This article sets out to separate what has been achieved from what has not, anchored in the most current randomized and prospective data.

Sepsis early warning: a contradiction in the field's most-tested showcase

Sepsis prediction is the most intensively studied CDSS domain, because early intervention saves lives and electronic data are plentiful. The strongest positive evidence here is the SCREEN trial, published in 2025. This stepped-wedge cluster randomized controlled trial (RCT), conducted across 5 hospitals and 45 wards in Saudi Arabia, enrolled 60,055 patients. It tested a qSOFA-based electronic alert and found, for the primary outcome of 90-day in-hospital mortality, an adjusted relative risk (aRR) of 0.85 (95% CI 0.77-0.93; P<0.001) (according to PubMed; Arabi et al., JAMA 2025). Lactate testing (aRR 1.30) and intravenous fluid administration (aRR 2.17) increased, while vasopressor use and multidrug-resistant organisms decreased.

This is a landmark demonstration that electronic sepsis screening can lower mortality. But two critical nuances must not be overlooked. First, the alert in SCREEN was not a machine-learning model but a rule-based qSOFA threshold — a relatively simple decision rule rather than "AI." Second, the intervention was not harmless: code blue activations, initiation of kidney replacement therapy, and Clostridioides difficile infections all increased. More intervention does not always mean better outcomes.

When we turn to machine-learning models, the picture is more cautious. The widely deployed Epic Sepsis Model (ESM) performed poorly in its first major external validation (Wong et al., JAMA Internal Medicine 2021), with an AUC of just 0.63. An independent 2024 evaluation in two county emergency departments (Ostermayer et al., JAMIA Open) found a sensitivity of 14.7%, a PPV of 7.6%, and a median lead time of 0 minutes within a 6-hour window — meaning the alert added very little of clinical value.

The first large multicenter prospective validation of the model's updated v2 version, published in 2026 and covering 4 US health systems and 227,091 encounters, showed nuanced progress: institution-level AUROC rose to 0.82-0.92 (JAMA Network Open 2026). Yet the same study reported that at a 60% sensitivity threshold the positive predictive value remained only 0.13-0.26, with high cross-institutional variability and a heavy alarm burden. The authors' clear message: every institution should perform local validation and implement alert-silencing strategies. The ESM story is therefore better reframed not as a "bad model" but as "discrimination improved, yet clinical benefit and PPV remain unproven."

Two sources, two outcomes — no single "truth"

The evidence on whether sepsis CDSS affects mortality must be read side by side: the SCREEN RCT lowered mortality with a rule-based alert (aRR 0.85); by contrast, the Epic ESM v1/v2 validations showed a machine-learning model with low PPV and as-yet-unproven clinical benefit. These are different model types and do not invalidate one another.

The false-alarm problem and alert fatigue

The natural consequence of low PPV is a large number of false alarms, which leads to "alert fatigue" — the state in which clinicians begin to ignore warnings. The scale of the problem is concrete: a systematic review and meta-analysis found a pooled override rate of 90% (95% CI 85-95) for drug-drug interaction alerts (Health Informatics Journal 2024). This is why next-generation sepsis systems make "minimizing false alarms" a primary design goal: the prospective COMPOSER-LLM pipeline reported only 0.0086 false alarms per patient-hour, while SepsisAI reported a 3.18% false-alarm rate in intensive care (PLOS Digital Health 2024). Reducing alarm burden is a precondition for preserving clinical trust.

The big picture: the gap between technical success and clinical benefit

Stepping beyond individual domains to the whole, a comprehensive systematic review and meta-analysis published in 2026 offers a striking summary. Evaluating 50 studies across 17 specialties, the analysis found that predictive AI-based CDSS achieved a pooled AUC of 0.652 (95% CI 0.562-0.743), with sensitivity 0.660 and specificity 0.819 (PLOS Digital Health 2026). More importantly, 76% of the studies were retrospective, and 64% reported only technical metrics without any clinical workflow data. The authors' central point is that there is a critical gap between technical validation and real-world clinical benefit.

This gap appears in other domains too. A 2024 meta-analysis of CDSS in acute kidney injury (10 RCTs, 18,355 patients) failed to show a significant benefit for all-cause mortality or kidney replacement therapy; only hyperkalemia incidence fell and eGFR trajectory improved (Renal Failure 2024). Negative findings are every bit as instructive as positive ones.

Large language models: a paradox in diagnosis, a modest gain in management

The effect of generative AI on clinical reasoning holds some of the most intriguing and contradictory recent findings. In Goh and colleagues' diagnostic reasoning RCT (according to PubMed; JAMA Network Open 2024), when physicians used an LLM (GPT-4) in addition to conventional resources, there was no significant difference in diagnostic performance (adjusted difference +2 points; P=0.60). The striking result: the LLM alone outperformed the physician groups by 16 points (95% CI 2-30; P=0.03) — yet that advantage vanished when combined with a clinician. This is the paradox known in the field as the "tool to teammate" problem, signalling that human-AI collaboration remains unsolved.

By contrast, the same group's management reasoning RCT (Nature Medicine 2025, 92 physicians) found a modest but significant gain: the LLM-assisted group was +6.5% (95% CI 2.7-10.2; P<0.001) better than conventional resources. Diagnosis and management are not the same; benefit is clearly task-dependent. The effect may be larger in resource-limited settings: in a study of 60 physicians in Pakistan, LLM support improved diagnostic performance by 27.5% (Nature Health 2025) — a markedly larger effect than in high-income-country studies.

Study / SystemDesignKey outcome
SCREEN (JAMA 2025)Cluster RCT, n=60,05590-day mortality aRR 0.85; but code blue and C. diff increased
Epic ESM v2 (JAMA Netw Open 2026)Prospective validation, n=227,091AUROC 0.82-0.92; PPV 0.13-0.26 (low), high alarm burden
Predictive CDSS meta-analysis (2026)SR+meta, 50 studiesPooled AUC 0.652; 76% retrospective
Goh — diagnosis RCT (2024)RCT, 50 physiciansNo physician benefit; LLM alone +16 points
Goh — management RCT (2025)RCT, 92 physiciansLLM-assisted group +6.5%
Automation bias RCT (NEJM AI 2025)RCT, 44 physiciansWith erroneous advice, accuracy 84.9% → 73.3% (−14 points)

Automation bias: the most serious and least solved risk

Perhaps the most cautionary finding concerns automation bias. In a 2025 RCT (NEJM AI), 44 physicians who had completed AI literacy training were presented with 6 vignettes, 3 of which carried deliberately erroneous LLM advice. In the group exposed to faulty advice, diagnostic accuracy fell from 84.9% to 73.3% (an adjusted 14.0-point reduction); top-choice accuracy dropped from 90.5% to 76.1% (−18.3 points). The implication is unsettling: even among physicians trained to critically appraise AI, and even when consultation was discretionary, clinicians were misled when the model was confidently wrong. AI literacy helps, but it does not immunize against automation bias.

The regulatory landscape and the standards gap

The regulatory framework is maturing rapidly but unevenly. The number of authorized AI-enabled medical devices in the US rose from about 950 in August 2024 to more than 1,250 by July 2025, the vast majority in radiology. A striking gap stands out: an analysis of 1,016 authorizations found that 84.4% of devices were image-based, while only 3 devices (0.4%) were tabular/EHR-based, and none yet incorporated an LLM (npj Digital Medicine 2025). In other words, most EHR-based CDSS such as sepsis scores operate under a "clinical decision support exemption" rather than FDA clearance — in a less regulated space.

Internationally, the World Health Organization's January 2024 guidance on large multi-modal models issued more than 40 recommendations, including independent third-party audit and mandatory post-deployment monitoring. The European Union AI Act classifies medical-device AI as "high risk," with high-risk obligations taking effect in 2026. Even so, the field's biggest methodological shortfall is reporting quality: evaluations meet a median of only 3.5 of 17 DECIDE-AI criteria. Adherence to standards such as TRIPOD+AI and DECIDE-AI is critical to the field's credibility.

A technical threat that cannot be ignored: dataset shift

Finally, deployed models degrade over time. As patient demographics, laboratory methods, or coding systems change (for example, the transition from ICD-9 to ICD-10), model performance declines — a phenomenon called "dataset shift" (Journal of Biomedical Informatics 2025). This shows that CDSS are not "set and forget" systems; they require continuous monitoring and, when necessary, retraining. The fact that ESM v2's AUROC varied between 0.82 and 0.92 across institutions reflects the same reality: a model that works well in one institution is not reliable elsewhere without local validation.

Conclusion

The 2025-2026 evidence justifies neither blind optimism nor wholesale rejection of AI in clinical decision support. What has been achieved: AI, and even simple rule-based systems, can improve process measures (timing of lactate, fluids, antibiotics); at least one large RCT (SCREEN) showed electronic screening lowers mortality; LLMs can produce high diagnostic accuracy in isolated tests and offer a modest gain in management decisions. What has not been achieved: prospective, randomized, patient-outcome evidence remains scarce; the clinical benefit of machine-learning sepsis models is unproven (low PPV, heavy alarm burden); the human-AI collaboration paradox is unresolved; and automation bias can cut accuracy by 14 points even in trained physicians.

The practical takeaway is clear: these systems should be used alongside the clinician under careful oversight, not in place of them. Each institution should validate locally on its own data, actively manage alarm burden, monitor for model degradation, and adhere to standard reporting frameworks (TRIPOD+AI, DECIDE-AI). The field's real bottleneck is no longer higher AUC; it is prospective evidence of clinical benefit, reliable human-AI collaboration, and transparent, sustainable monitoring. AI is on its way to becoming a genuine assistant in the clinic, but that journey is not yet complete.

References

  1. Arabi YM, et al. Electronic Sepsis Screening Among Patients Admitted to Hospital Wards: A Stepped-Wedge Cluster Randomized Trial (SCREEN). JAMA. 2025;333(9):763-773. link
  2. Multicenter Prospective Validation of an Updated Proprietary Sepsis Prediction Model (Epic Sepsis Model v2). JAMA Network Open. 2026;9(2):e260181. link
  3. Ostermayer DG, et al. External validation of the Epic sepsis predictive model in 2 county emergency departments. JAMIA Open. 2024;7(4):ooae133. link
  4. Wong A, et al. External Validation of a Widely Implemented Proprietary Sepsis Prediction Model in Hospitalized Patients. JAMA Internal Medicine. 2021;181(8):1065-1070. link
  5. Waldock WJ, et al. Performance of predictive AI-based clinical decision support systems across clinical domains: a systematic review and meta-analysis. PLOS Digital Health. 2026;5(3):e0001310. link
  6. Effect of clinical decision support systems on clinical outcomes in acute kidney injury: a systematic review and meta-analysis. Renal Failure. 2024. link
  7. Goh E, et al. Large Language Model Influence on Diagnostic Reasoning: A Randomized Clinical Trial. JAMA Network Open. 2024;7(10):e2440969. link
  8. Goh E, et al. GPT-4 assistance for improvement of physician management reasoning: a randomized clinical trial. Nature Medicine. 2025. link
  9. Automation Bias in Large Language Model–Assisted Diagnostic Reasoning Among Physicians Trained in AI Literacy: A Randomized Clinical Trial. NEJM AI. 2025. link
  10. Singh K, Lotter W, et al. How AI is used in FDA-authorized medical devices: a taxonomy across 1,016 authorizations. npj Digital Medicine. 2025;8:388. link
  11. World Health Organization. Ethics and governance of artificial intelligence for health: guidance on large multi-modal models. WHO. 2024. link
  12. Override rate of clinical decision support drug-drug interaction alerts: a systematic review and meta-analysis. Health Informatics Journal. 2024. link
Disclaimer: This content is for educational and informational purposes only and does not substitute for diagnosis or treatment decisions. Clinical decision support systems are designed to support the final decision of the physician; alerts must always be interpreted in clinical context.