AI-Driven Triage and Symptom Assessment: The 2026 Evidence Landscape

Creator: Cem Akaltun, MD
Published: 2026-05-26

AI-based symptom assessment tools can match clinicians on triage accuracy in vignette settings, yet real-world validation, over-triage, and distribution shift remain unresolved problems.

By Cem Akaltun, MD · June 9, 2026Updated · ~12 min read Clinical AI & LLMs

When a patient opens a phone app at midnight to type in chest-pain symptoms, or a physician reads an AI-generated pre-assessment report before the patient even walks in, we are witnessing one of the fastest-growing frontiers of modern medicine: AI-driven triage and digital symptom assessment. The question seems simple, but the answer is layered. Can these tools genuinely make the right urgency decision, or do they merely produce fluent yet unreliable text? Meta-analyses and real-world studies published in 2025-2026 have, for the first time, placed the answer on a quantitative and honest footing.

Two Distinct Tasks: Don't Conflate Triage and Diagnosis

To read this field correctly, two concepts must be separated. Diagnosis identifies which condition the patient has. Triage, by contrast, classifies how urgently the patient needs care — for example, "go to the emergency department now," "see a physician within 24 hours," or "manage at home." This distinction matters because the evidence consistently shows that digital tools perform better at triage than at diagnosis.

This pattern was clearly established in the field's landmark study, the 2022 systematic review by Wallace and colleagues: while the primary-diagnosis accuracy of digital symptom checkers sat at just 19-37.9%, triage accuracy spanned a much wider and generally higher band of 48.8-90.1%. Even so, roughly 69% of studies showed "suboptimal" triage. These figures set the benchmark for the years that followed.

Did Large Language Models Shift the Balance?

With the mainstreaming of ChatGPT in late 2022 came a rapid move from rule-based symptom checking toward large language models (LLMs). The impact of this shift on triage performance is examined in the strongest evidence layer of 2024-2026 — meta-analyses — and the results do not point in a single direction.

One of the most comprehensive studies is the 2026 meta-analysis by Chen and colleagues in npj Digital Medicine, covering 50 studies and 25 different LLMs. Its striking finding: there is no significant difference between LLMs and healthcare professionals in triage accuracy (relative accuracy 1.01; 95% CI 0.94-1.09). In other words, at least in study conditions, LLMs can make triage decisions on par with clinicians. Yet perhaps the more important finding concerns collaboration: an LLM-assisted professional outperforms the professional alone (relative accuracy 1.13 for top-1 diagnosis; 95% CI 1.00-1.27). This is the field's most robust positive signal, indicating that the value of AI lies not in "replacement" but in "decision support."

There is also evidence that tempers this optimistic picture. The 2026 meta-analysis by Gao and colleagues in BMC Emergency Medicine, covering 15 studies, found a pooled triage accuracy of 0.70 for GPT-4 and its derivatives (95% CI 0.58-0.81), reaching 0.81 in optimized versions — a marked leap over GPT-3.5's weak 0.51. The authors, however, issue a clear caveat: GPT-4's superiority over humans is statistically fragile; significance in sensitivity analysis hinges on specific studies, and heterogeneity (I²) is very high. Similarly, the 2024 meta-analysis by Kaboudi and colleagues reported a high accuracy of 0.86 for GPT-4.0 but flagged a publication-bias signal in funnel-plot analysis — suggesting that favourable results may be more likely to be published while unfavourable ones stay in the drawer.

FRAGILE SUPERIORITY

Although meta-analyses find LLM triage equivalent to clinicians (relative accuracy 1.01), this edge disappears under sensitivity analysis, heterogeneity runs above I²>90%, and a publication-bias signal is present. "Equivalent" does not mean "reliable in every context."

Numbers Side by Side: Comparative Accuracy

Placing the triage/diagnosis performance of different tools and models side by side is essential for an unembellished appraisal. The table below presents representative figures drawn from the major studies of 2024-2026 — with each row's context (vignette versus real patient) stated in its column.

Tool / Model	Task	Accuracy	Context (Source, Year)
LLM (pooled)	Triage	≈ clinician (rel. acc. 1.01)	Vignette/MA (Chen, npj 2026)
GPT-4 derivatives	Triage	0.70 (opt. 0.81)	Vignette/MA (Gao, BMC 2026)
GPT-4.0	Triage	0.86 (I²=93%)	Vignette/MA (Kaboudi 2024)
ChatGPT-4o	Paediatric triage	76.1%	Real patients (Frontiers 2026)
Nurse	Paediatric triage	53.1%	Real patients (Frontiers 2026)
Platform24 (rule-based)	Triage safety	94%	Real mistriage cases (2025)
NHS 111 online	Diagnosis	80% (16/20)	Vignette (Cureus 2025)
ChatGPT (general)	Diagnosis (vignette)	70%	Vignette (Cureus 2025)
Symptom apps	Self-triage	25.9-88.0%	SR (npj 2025)
LLMs	Self-care cases	10.8%	SR (npj 2025)

The Real-Patient Test: Beyond the Vignette

Most of the figures above were obtained from "vignettes" — pre-prepared, clean, standardized case scenarios. A real patient, however, describes symptoms in scattered, ambiguous, everyday language. This is why real-world studies are far more valuable: the true test begins here.

A single-centre, prospective paediatric emergency study from Türkiye (Frontiers in Pediatrics, 2026; 1,505 real child patients) is illuminating on this point. ChatGPT-4o's triage accuracy was 76.1% (95% CI 73.9-78.2), versus 53.1% for nurses and 47.0% for Grok 3. Agreement (Cohen's κ) was 0.69 (good) for ChatGPT-4o, 0.42 for nurses, and 0.31 for Grok 3. Yet the same study adds a critical nuance: for children with chronic illness, nurses were markedly superior (59.5% vs ChatGPT 28.3%). Moreover, while Grok 3 caught the most critical cases (ESI-2) with 97.7% sensitivity, its specificity was low and it over-triaged in 36.3% of cases. The authors' conclusion is clear: AI should support, not replace, the nurse.

Another significant real-world study is the prospective post-market evaluation of Ada Health's "digital front door" within the CUF hospital network in Portugal (npj Digital Medicine, 2026). Here, treating physicians generally found the system's urgency advice and reports appropriate; reading the report before consultation increased physician preparedness and perceived efficiency. To be honest, however, this study measured appropriateness and perception rather than quantitative diagnostic accuracy — a different type of evidence, but its real clinical setting adds value.

Perhaps the most cleverly designed validation comes from Platform24's Swedish study (Scand J Prim Health Care, 2025): the tool was tested on 390 vignettes derived from cases that had actually been mistriaged in practice, and it achieved 91% accuracy (95% CI 88-94) and 94% safety (95% CI 91-96). In other words, the system stayed safe even on the hard cases where humans had erred.

Failures and Systematic Risks

An honest appraisal must name failures as clearly as successes. The evidence points to several systematic weaknesses.

Over-triage and self-care blindness: A comprehensive 2025 systematic review (npj Digital Medicine, 19 studies) found that LLMs achieved only 10.8% accuracy in self-care cases (those that could in fact be managed at home). This means the models fail to recognize low urgency and systematically push patients toward the emergency department more than necessary — a serious problem causing resource waste and undue anxiety. In the same review, app accuracy in emergency cases was 74.5% and LLM accuracy 66.7%.

Distribution shift: A 2026 methodological preprint (arXiv) argues that the high scores on curated vignettes collapse under real, patient-written, colloquial, and ambiguous input. According to the authors, evaluation format drives triage failure more than model capability does, and current safety benchmarks may overstate real-world reliability. As this is a preprint (not peer-reviewed), its evidential weight is low, but the warning is worth noting.

Automation bias and sycophancy: Even physicians trained in AI literacy have been shown to adopt erroneous LLM recommendations. LLMs also exhibit a tendency toward "sycophancy" — conforming to a false belief expressed by the patient or clinician and overriding correct information. This is an insidious risk within the clinical decision chain.

Subgroup and modality weaknesses: In situations requiring visual input (such as dermatological findings), text-based LLMs remain markedly weak; in one study, dermatology referral accuracy fell to 79.57%, and 43.42% of errors stemmed from "over-reliance on history rather than current symptoms." Performance drops in subgroups such as young children and chronically ill patients, and evidence of language/ethnic bias exists.

The Regulatory Landscape: Which Tool Is a Medical Device?

From a clinical standpoint, a critical distinction is whether a tool is approved as a medical device. Here the picture is mixed. In Europe, Ada Health holds Class IIa medical-device certification under the EU-MDR (TÜV SÜD), while Infermedica's triage/intake/follow-up products are certified as Class IIb and recognized in the United Kingdom. In the United States, by contrast, the FDA has not yet approved general LLMs for symptom-checking or triage as clinical triage devices; its 2025 approach is built on the "Total Product Life Cycle" (TPLC) and the "Predetermined Change Control Plan" (PCCP). Aidoc's CARE1 system received the first foundation-model-based FDA clearance in February 2025 — but this is an image-triage tool, not a symptom checker; the distinction must be kept clear.

The practical upshot is this: the general-purpose chatbots patients use on their own (such as ChatGPT and Gemini) are not approved as medical devices, and when used for "triage," they fall outside regulatory scope. Comparative studies support this point: purpose-built tools such as NHS 111 online consistently outperform general-purpose LLMs (Cureus 2025; NHS 111 identified 14/15 emergencies, ChatGPT 12/15).

The Next Generation: GPT-5 and Agent-Based Systems

The GPT-5 family entered the scene in 2025-2026, and preliminary work (medrxiv 2025) reports MedQA accuracy of 86.3% on adult questions and 88.5% in paediatrics — with a more structured diagnostic pathway and stronger safety awareness than GPT-4. Agent-based systems (architectures that reason over multiple steps) can surpass even this baseline (diagnostic accuracy 89.3% vs 84.6%). Yet most of these figures again come from exam-style and vignette settings; real-patient validation remains the missing link.

Conclusion

AI-driven triage and symptom assessment has made rapid, genuine progress over the past two years — but that progress must be read without embellishment. What is proven: LLM-based triage can reach accuracy equivalent to clinicians in vignette settings (relative accuracy 1.01); the clinician-plus-LLM combination outperforms the clinician alone (1.13); and certified rule-based tools can demonstrate high safety on real mistriage cases (94%). What is unproven or uncertain: real-world patient benefit (reducing healthcare use, improving outcomes) is weakly evidenced; LLM superiority is fragile (it disappears under sensitivity analysis, I²>90%, publication bias); diagnosis still lags humans; and over-triage in self-care cases is catastrophic.

The most consistent lesson of this picture is this: AI is strongest not as a replacement for the clinician, but as a support to one. When general-purpose chatbots are used by patients for "triage," they remain unregulated and risky; meanwhile, purpose-built, certified, real-world-validated tools are steadily maturing. The conflicting evidence — "LLMs are equivalent" (Chen), "the superiority is fragile" (Gao), "purpose-built tools are superior" (NHS 111) — does not collapse into a single headline; and the right stance is to read this plurality in context. For the clinician in practice, the message is plain: treat these tools as a decision-support layer, never give them the final word, and preserve human judgment especially in low-urgency decisions.

References

Chen et al. Independent and collaborative performance of LLMs and healthcare professionals in diagnosis and triage. npj Digital Medicine. 2026. site
Gao et al. Accuracy of ChatGPT in adult emergency department triage: a systematic review and meta-analysis. BMC Emergency Medicine. 2026. site
Kaboudi et al. Diagnostic Accuracy of ChatGPT for Patients' Triage: Systematic Review and Meta-Analysis. Archives of Academic Emergency Medicine. 2024. site
Shan et al. Comparing Diagnostic Accuracy of Clinical Professionals and Large Language Models. JMIR Medical Informatics. 2025. site
Wang et al. Accuracy of LLMs When Answering Clinical Research Questions: A Network Meta-Analysis. Journal of Medical Internet Research. 2025. site
Pimenta et al. Appropriateness and utility of a clinical decision support system at the digital front door. npj Digital Medicine. 2026. site
Ilicki et al. Evaluating a digital triage symptom checker using historical triage-related adverse events. Scandinavian Journal of Primary Health Care. 2025. site
Patient Triage and Guidance in Emergency Departments Using LLMs: A Multimetric Study. Journal of Medical Internet Research. 2025. site
Should we leave paediatric emergency triage to AI? ChatGPT-4o versus Grok 3. Frontiers in Pediatrics. 2026. site
Accuracy of online symptom assessment apps, LLMs, and laypeople for self-triage: a systematic review. npj Digital Medicine. 2025. site
Wallace et al. The diagnostic and triage accuracy of digital and online symptom checker tools: a systematic review. npj Digital Medicine. 2022. site
Ada Health. Medical quality and EU-MDR Class IIa certification. Regulatory document. 2025. site

Disclaimer: This content is for educational and informational purposes only and does not substitute for diagnosis or treatment decisions. Symptom-checker and triage apps do not replace a physician evaluation. For acute symptoms (e.g., severe chest pain, shortness of breath, altered consciousness), seek emergency care without delay.