Medical Large Language Models: They Passed the Exam — But Did They Win the Clinic?

Creator: Cem Akaltun, MD
Published: 2026-05-26

Medical large language models now exceed expert level on exam benchmarks, yet the measured benefit of real physician-AI collaboration is far more modest and inconsistent than the headlines suggest. We read the current evidence honestly.

By Cem Akaltun, MD · June 9, 2026Updated · ~12 min read Clinical AI & LLMs

A few years ago, the debate over medical artificial intelligence revolved around one simple question: can a language model pass the United States Medical Licensing Examination (USMLE)? That question is now firmly behind us. The latest generation of models not only passes these exams but routinely outscores the average physician. The genuinely interesting question has shifted: does a model that shines on a test perform the same way when it sits across from a real patient, or works alongside a clinician? The most important evidence from 2025-2026 suggests these two worlds are far more different than we assumed.

This article offers a hype-free account of where large language models (LLMs) stand in medicine today: what they have truly achieved, which claims remain unproven, and what pitfalls await us in clinical practice. We place conflicting findings side by side rather than declaring any single source the definitive truth.

A new generation on the benchmarks: GPT-5, Med-Gemini and open models

Progress on standardized exams has been striking. Google's Med-Gemini reached 91.1% accuracy on the USMLE-style MedQA test using an uncertainty-guided web-search strategy, surpassing the previous best (Med-PaLM 2). Yet the same team added a crucial note of honesty: when they re-labelled the MedQA questions, they found 7.4% of them unsuitable for evaluation — items with missing information or multiple defensible answers. In other words, part of the gap between 91% and 100% reflects the limits of the test, not the model.

On the open-source side, Google's MedGemma family, released in 2025, stands out; its 27-billion-parameter text model scored 87.7% on MedQA. A more concrete finding: in an unblinded evaluation, 81% of the chest X-ray reports generated by MedGemma-4B were judged by a US board-certified radiologist to be "accurate enough to lead to similar patient management."

At OpenAI, the focus moved from old USMLE-style tests toward realistic scenarios. HealthBench — an evaluation of 5,000 physician-curated scenarios across 26 specialties and 49 languages — embodies this shift. On its hardest subset (HealthBench Hard), GPT-5's thinking mode scored 46.2%, a marked jump from OpenAI o3's 31.6%. OpenAI further reported that hallucinations in challenging conversations fell roughly 8-fold versus o3, and errors in emergency scenarios dropped more than 50-fold compared with GPT-4o.

A new paradigm: conversational diagnostic AI

Answering a static question is very different from talking to a patient and taking a history. Google's AMIE system targets this second capability. In a randomized, double-blind study published in Nature in 2025, AMIE was compared against 20 primary care physicians across 159 case scenarios; it performed at least as well as, or better than, physicians on 30 of 32 axes rated by specialists and 25 of 26 axes rated by patient actors, with higher diagnostic accuracy.

The cardiology study published in Nature Medicine in early 2026 used a more realistic setup: 107 real cases of suspected hereditary cardiomyopathy were presented to 9 general cardiologists alongside raw data such as ECG, echocardiography and cardiac MRI. Subspecialist reviewers preferred AMIE-assisted assessments 46.7% of the time versus 32.7% for cardiologist-only assessments (P=0.02). This is early but encouraging evidence that the model can support — not replace — the specialist.

The critical question: is physician + AI really better?

So far the picture looks bright. But the single most important finding of 2025-2026 unsettles that expectation precisely here. A systematic review and meta-analysis pooling 10 randomized trials, published in npj Digital Medicine in 2026, reaches a striking conclusion: human-plus-AI configurations do not universally outperform a strong AI agent working alone. The pooled effect on diagnostic accuracy was not significant; while composite quality scores showed a small improvement, the prediction interval was so wide that it did not even exclude the possibility of real-world harm. Indeed, a broader analysis of 106 studies found that, on average, human-AI teams performed worse than the best single agent.

THE COLLABORATION PARADOX

An AI alone may be excellent; a physician alone is competent; yet combining the two does not automatically yield the best result. The problem usually lies not in the model but in how the human-AI interaction is designed. In Goh and colleagues' randomized trial in JAMA Network Open, there was no significant difference between physicians with GPT-4 access (76%) and the control group (74%) (P=0.60), while GPT-4 on its own scored higher than both physician groups.

Good news from the real world: Penda Health

There is an encouraging side to the picture, and this time it comes from the field rather than from simulation. The "AI Consult" study run by Penda Health clinics with OpenAI in Kenya covered roughly 39,849 patient visits across 15 Nairobi clinics from January to April 2025. In this setup, where the AI ran quietly in the background as an assistant (with green/yellow/red alerts), AI-supported visits saw diagnostic errors fall by 16%, treatment errors by 13%, and history-taking errors by 32%. The difference lay not in the model replacing the clinician, but in alerting them at critical moments without disrupting their workflow. It should be noted, however, that this study was published as an institutional report and preprint rather than in a peer-reviewed journal.

Clinical documentation: the fastest-spreading use case

The quietest yet most widespread clinical application of LLMs has been "ambient" assistants that automatically turn a patient encounter into a note. By 2024, 31.5% of US hospitals had begun using generative AI integrated into the electronic health record, with more than 100 tools on the market (such as Abridge, DAX Copilot, Suki and Nabla). One validation study measured a 1.47% hallucination rate and a 3.45% omission rate in note generation by these tools. A notable regulatory gap remains: none of these tools is an FDA-cleared medical device; they operate under a clinical decision support exemption.

Limits and an honest frame

The findings overshadowed by glowing headlines matter, for clinical use, at least as much as the successes:

Exam score ≠ clinical competence. In the CRAFT-MD evaluation, GPT-4's accuracy was 82% on ready-made case descriptions but fell to 62.7% when it had to conduct a dynamic simulated conversation with a patient. The model weakened markedly once it had to gather information itself.

LLMs can still lag behind expert physicians. In a meta-analysis published in JMIR Medical Informatics covering 30 studies and 4,762 cases, clinical professionals generally outperformed LLMs; however, heterogeneity was very high (I²=77%) and two-thirds of the studies carried a high risk of bias, which makes any firm verdict difficult.

De-skilling is a concrete risk. A multicentre observational study published in The Lancet Gastroenterology & Hepatology in 2025 found that, among endoscopists regularly exposed to AI-assisted colonoscopy, the adenoma detection rate without AI fell from 28.4% to 22.4% (about a 20% relative decline). This is a measured cost of over-reliance on automation.

Hallucination is not solved. Although it varies by task, factual error rates can rise to 26-36% on some documentation tasks; on dedicated benchmarks such as MedHallu, even the best models struggle to detect "hard" medical hallucinations.

Finding	Measure	Source / Study type
Med-Gemini, MedQA accuracy	91.1% (7.4% of items had label issues)	arXiv, product report
GPT-5, HealthBench Hard	46.2% (o3: 31.6%)	OpenAI benchmark
AMIE preferred in cardiology	46.7% vs 32.7% (P=0.02)	Nature Medicine, RCT (107 cases)
Physician + GPT-4 vs physician alone	76% vs 74% (P=0.60; no difference)	JAMA Netw Open, RCT
GPT-4: static case → dynamic dialogue	82% → 62.7%	CRAFT-MD evaluation
Penda Health, diagnostic error reduction	16% (treatment 13%, history 32%)	Real world, 39,849 visits
De-skilling in colonoscopy	Adenoma detection 28.4% → 22.4%	Lancet Gastro, observational

Regulation and governance

The regulatory framework is taking shape rapidly. The FDA's list of AI-enabled medical devices grew from 950 in August 2024 to more than 1,250 by July 2025. In 2025, Aidoc's CARE1 model became the first foundation-model-based clinical AI to receive FDA authorization. In January 2024, the World Health Organization issued ethics and governance guidance for large multi-modal models containing more than 40 recommendations; the European Union's AI Act places most medical applications in the "high-risk" category. Even so, standards specific to generative AI — especially in an environment where models are continuously updated — are still maturing.

Conclusion

For medical large language models, the 2025-2026 period marked a transition from the "superhuman doctor" narrative toward a more mature and more humble evidence base. Real achievements exist: models reached expert level on standard exams and controlled vignettes (Med-Gemini at 91.1%, AMIE matching or exceeding physicians on the large majority of axes); measurable error reduction was demonstrated in real deployments such as Penda Health; and the documentation burden eased. But what remains unproven is just as instructive: an exam score does not guarantee clinical competence, combining a physician with AI does not automatically yield a better outcome, hallucination and de-skilling are measured real risks, and most of the evidence still comes from "near-clinical" simulation rather than the actual bedside.

The practical takeaway is clear: today, LLMs deliver their greatest value as tools that support the physician rather than replace them — running quietly in the background, with their output always verified by an expert. The central engineering question in this field is no longer "how smart is the model?" but "how can humans and AI work together safely?" The answer lies not in the model's parameter count, but in the design of the clinical workflow and its oversight.

References

Bedi S, et al. Human–large language model collaboration in clinical medicine: a systematic review and meta-analysis. npj Digital Medicine. 2026. site
Goh E, et al. Large Language Model Influence on Diagnostic Reasoning: A Randomized Clinical Trial. JAMA Network Open. 2024;7(10):e2440969. site
McDuff D, et al. Towards conversational diagnostic artificial intelligence (AMIE). Nature. 2025. site
Google DeepMind. A large language model for complex cardiology care (AMIE). Nature Medicine. 2026. site
Comparative meta-analysis. Comparing Diagnostic Accuracy of Clinical Professionals and Large Language Models. JMIR Medical Informatics. 2025. site
Saab K, et al. Capabilities of Gemini Models in Medicine (Med-Gemini). arXiv 2404.18416. 2024. site
Google Research. MedGemma: our most capable open models for health AI development. Google Research Blog. 2025. site
OpenAI. Introducing HealthBench (GPT-5 HealthBench Hard). OpenAI. 2025. site
Korom R, et al. AI-based Clinical Decision Support for Primary Care: A Real-World Study (Penda Health). arXiv 2507.16947. 2025. site
Budzyń K, et al. Endoscopist deskilling risk after exposure to artificial intelligence in colonoscopy. The Lancet Gastroenterology & Hepatology. 2025. site
World Health Organization (WHO). AI ethics and governance guidance for large multi-modal models. WHO News. 2024. site
Pandya A, et al. MedHallu: A Benchmark for Detecting Medical Hallucinations in LLMs. arXiv 2502.14302. 2025. site

Disclaimer: This content is for educational and informational purposes only and does not substitute for diagnosis or treatment decisions. Medical outputs from large language models should always be verified by a competent physician; these models should not be used as autonomous clinical decision-making tools.