Medical Large Language Models: They Passed the Exam — But Did They Win the Clinic?
Medical large language models now exceed expert level on exam benchmarks, yet the measured benefit of real physician-AI collaboration is far more modest and inconsistent than the headlines suggest. We read the current evidence honestly.
A few years ago, the debate over medical artificial intelligence revolved around one simple question: can a language model pass the United States Medical Licensing Examination (USMLE)? That question is now firmly behind us. The latest generation of models not only passes these exams but routinely outscores the average physician. The genuinely interesting question has shifted: does a model that shines on a test perform the same way when it sits across from a real patient, or works alongside a clinician? The most important evidence from 2025-2026 suggests these two worlds are far more different than we assumed.
This article offers a hype-free account of where large language models (LLMs) stand in medicine today: what they have truly achieved, which claims remain unproven, and what pitfalls await us in clinical practice. We place conflicting findings side by side rather than declaring any single source the definitive truth.
A new generation on the benchmarks: GPT-5, Med-Gemini and open models
Progress on standardized exams has been striking. Google's Med-Gemini reached 91.1% accuracy on the USMLE-style MedQA test using an uncertainty-guided web-search strategy, surpassing the previous best (Med-PaLM 2). Yet the same team added a crucial note of honesty: when they re-labelled the MedQA questions, they found 7.4% of them unsuitable for evaluation — items with missing information or multiple defensible answers. In other words, part of the gap between 91% and 100% reflects the limits of the test, not the model.
On the open-source side, Google's MedGemma family, released in 2025, stands out; its 27-billion-parameter text model scored 87.7% on MedQA. A more concrete finding: in an unblinded evaluation, 81% of the chest X-ray reports generated by MedGemma-4B were judged by a US board-certified radiologist to be "accurate enough to lead to similar patient management."
At OpenAI, the focus moved from old USMLE-style tests toward realistic scenarios. HealthBench — an evaluation of 5,000 physician-curated scenarios across 26 specialties and 49 languages — embodies this shift. On its hardest subset (HealthBench Hard), GPT-5's thinking mode scored 46.2%, a marked jump from OpenAI o3's 31.6%. OpenAI further reported that hallucinations in challenging conversations fell roughly 8-fold versus o3, and errors in emergency scenarios dropped more than 50-fold compared with GPT-4o.
A new paradigm: conversational diagnostic AI
Answering a static question is very different from talking to a patient and taking a history. Google's AMIE system targets this second capability. In a randomized, double-blind study published in Nature in 2025, AMIE was compared against 20 primary care physicians across 159 case scenarios; it performed at least as well as, or better than, physicians on 30 of 32 axes rated by specialists and 25 of 26 axes rated by patient actors, with higher diagnostic accuracy.
The cardiology study published in Nature Medicine in early 2026 used a more realistic setup: 107 real cases of suspected hereditary cardiomyopathy were presented to 9 general cardiologists alongside raw data such as ECG, echocardiography and cardiac MRI. Subspecialist reviewers preferred AMIE-assisted assessments 46.7% of the time versus 32.7% for cardiologist-only assessments (P=0.02). This is early but encouraging evidence that the model can support — not replace — the specialist.
The critical question: is physician + AI really better?
So far the picture looks bright. But the single most important finding of 2025-2026 unsettles that expectation precisely here. A systematic review and meta-analysis pooling 10 randomized trials, published in npj Digital Medicine in 2026, reaches a striking conclusion: human-plus-AI configurations do not universally outperform a strong AI agent working alone. The pooled effect on diagnostic accuracy was not significant; while composite quality scores showed a small improvement, the prediction interval was so wide that it did not even exclude the possibility of real-world harm. Indeed, a broader analysis of 106 studies found that, on average, human-AI teams performed worse than the best single agent.
THE COLLABORATION PARADOX
An AI alone may be excellent; a physician alone is competent; yet combining the two does not automatically yield the best result. The problem usually lies not in the model but in how the human-AI interaction is designed. In Goh and colleagues' randomized trial in JAMA Network Open, there was no significant difference between physicians with GPT-4 access (76%) and the control group (74%) (P=0.60), while GPT-4 on its own scored higher than both physician groups.
Good news from the real world: Penda Health
There is an encouraging side to the picture, and this time it comes from the field rather than from simulation. The "AI Consult" study run by Penda Health clinics with OpenAI in Kenya covered roughly 39,849 patient visits across 15 Nairobi clinics from January to April 2025. In this setup, where the AI ran quietly in the background as an assistant (with green/yellow/red alerts), AI-supported visits saw diagnostic errors fall by 16%, treatment errors by 13%, and history-taking errors by 32%. The difference lay not in the model replacing the clinician, but in alerting them at critical moments without disrupting their workflow. It should be noted, however, that this study was published as an institutional report and preprint rather than in a peer-reviewed journal.
Clinical documentation: the fastest-spreading use case
The quietest yet most widespread clinical application of LLMs has been "ambient" assistants that automatically turn a patient encounter into a note. By 2024, 31.5% of US hospitals had begun using generative AI integrated into the electronic health record, with more than 100 tools on the market (such as Abridge, DAX Copilot, Suki and Nabla). One validation study measured a 1.47% hallucination rate and a 3.45% omission rate in note generation by these tools. A notable regulatory gap remains: none of these tools is an FDA-cleared medical device; they operate under a clinical decision support exemption.
Limits and an honest frame
The findings overshadowed by glowing headlines matter, for clinical use, at least as much as the successes:
Exam score ≠ clinical competence. In the CRAFT-MD evaluation, GPT-4's accuracy was 82% on ready-made case descriptions but fell to 62.7% when it had to conduct a dynamic simulated conversation with a patient. The model weakened markedly once it had to gather information itself.
LLMs can still lag behind expert physicians. In a meta-analysis published in JMIR Medical Informatics covering 30 studies and 4,762 cases, clinical professionals generally outperformed LLMs; however, heterogeneity was very high (I²=77%) and two-thirds of the studies carried a high risk of bias, which makes any firm verdict difficult.
De-skilling is a concrete risk. A multicentre observational study published in The Lancet Gastroenterology & Hepatology in 2025 found that, among endoscopists regularly exposed to AI-assisted colonoscopy, the adenoma detection rate without AI fell from 28.4% to 22.4% (about a 20% relative decline). This is a measured cost of over-reliance on automation.
Hallucination is not solved. Although it varies by task, factual error rates can rise to 26-36% on some documentation tasks; on dedicated benchmarks such as MedHallu, even the best models struggle to detect "hard" medical hallucinations.
| Finding | Measure | Source / Study type |
|---|---|---|
| Med-Gemini, MedQA accuracy | 91.1% (7.4% of items had label issues) | arXiv, product report |
| GPT-5, HealthBench Hard | 46.2% (o3: 31.6%) | OpenAI benchmark |
| AMIE preferred in cardiology | 46.7% vs 32.7% (P=0.02) | Nature Medicine, RCT (107 cases) |
| Physician + GPT-4 vs physician alone | 76% vs 74% (P=0.60; no difference) | JAMA Netw Open, RCT |
| GPT-4: static case → dynamic dialogue | 82% → 62.7% | CRAFT-MD evaluation |
| Penda Health, diagnostic error reduction | 16% (treatment 13%, history 32%) | Real world, 39,849 visits |
| De-skilling in colonoscopy | Adenoma detection 28.4% → 22.4% | Lancet Gastro, observational |
Regulation and governance
The regulatory framework is taking shape rapidly. The FDA's list of AI-enabled medical devices grew from 950 in August 2024 to more than 1,250 by July 2025. In 2025, Aidoc's CARE1 model became the first foundation-model-based clinical AI to receive FDA authorization. In January 2024, the World Health Organization issued ethics and governance guidance for large multi-modal models containing more than 40 recommendations; the European Union's AI Act places most medical applications in the "high-risk" category. Even so, standards specific to generative AI — especially in an environment where models are continuously updated — are still maturing.
Conclusion
For medical large language models, the 2025-2026 period marked a transition from the "superhuman doctor" narrative toward a more mature and more humble evidence base. Real achievements exist: models reached expert level on standard exams and controlled vignettes (Med-Gemini at 91.1%, AMIE matching or exceeding physicians on the large majority of axes); measurable error reduction was demonstrated in real deployments such as Penda Health; and the documentation burden eased. But what remains unproven is just as instructive: an exam score does not guarantee clinical competence, combining a physician with AI does not automatically yield a better outcome, hallucination and de-skilling are measured real risks, and most of the evidence still comes from "near-clinical" simulation rather than the actual bedside.
The practical takeaway is clear: today, LLMs deliver their greatest value as tools that support the physician rather than replace them — running quietly in the background, with their output always verified by an expert. The central engineering question in this field is no longer "how smart is the model?" but "how can humans and AI work together safely?" The answer lies not in the model's parameter count, but in the design of the clinical workflow and its oversight.
References
- Bedi S, et al. Human–large language model collaboration in clinical medicine: a systematic review and meta-analysis. npj Digital Medicine. 2026. site
- Goh E, et al. Large Language Model Influence on Diagnostic Reasoning: A Randomized Clinical Trial. JAMA Network Open. 2024;7(10):e2440969. site
- McDuff D, et al. Towards conversational diagnostic artificial intelligence (AMIE). Nature. 2025. site
- Google DeepMind. A large language model for complex cardiology care (AMIE). Nature Medicine. 2026. site
- Comparative meta-analysis. Comparing Diagnostic Accuracy of Clinical Professionals and Large Language Models. JMIR Medical Informatics. 2025. site
- Saab K, et al. Capabilities of Gemini Models in Medicine (Med-Gemini). arXiv 2404.18416. 2024. site
- Google Research. MedGemma: our most capable open models for health AI development. Google Research Blog. 2025. site
- OpenAI. Introducing HealthBench (GPT-5 HealthBench Hard). OpenAI. 2025. site
- Korom R, et al. AI-based Clinical Decision Support for Primary Care: A Real-World Study (Penda Health). arXiv 2507.16947. 2025. site
- Budzyń K, et al. Endoscopist deskilling risk after exposure to artificial intelligence in colonoscopy. The Lancet Gastroenterology & Hepatology. 2025. site
- World Health Organization (WHO). AI ethics and governance guidance for large multi-modal models. WHO News. 2024. site
- Pandya A, et al. MedHallu: A Benchmark for Detecting Medical Hallucinations in LLMs. arXiv 2502.14302. 2025. site