Artificial Intelligence in Digital and Computational Pathology: Accelerating Approvals, Foundation Models, and an Honest Reckoning
Between 2024 and 2026, AI in digital pathology accelerated across regulatory approval, model scale, and clinical integration — yet behind the high-accuracy claims lie methodological weakness and a striking absence of prospective outcome evidence.
Pathology is the diagnostic backbone of modern medicine; the majority of cancer diagnoses still rest on a pathologist examining tissue sections under a microscope. Over the past decade, the digitization of this process — converting tissue sections into high-resolution whole-slide images (WSI) — has opened the door for artificial intelligence (AI) to analyze those images directly. By 2025-2026, the field has moved well beyond its earlier "experimental" framing into tangible maturity in terms of regulatory clearance, model scale, and clinical integration. Yet this rapid progress demands an honest reckoning: for the clinical decision-maker, the line between what has genuinely been proven and what remains a promise is critical. This article compiles the most current evidence without overstatement.
The 2024-2026 wave of regulatory clearances
The most striking update is the pace of approval. Of the roughly 48 de novo/510(k) authorizations granted to whole-slide imaging systems and software cleared through the FDA's pathology panel, 17 were issued between 2017 and 2023 (seven years), while 32 came in 2024-2025 alone. Approvals have, in other words, distinctly accelerated. The framing that held until recently — "there are only one or two FDA-cleared AI tools in pathology, with Paige as the lone example" — is now obsolete.
Three clearances illustrate different facets of this proliferation. Paige Prostate (2021, de novo) was the first example in tumor detection. To it was added Ibex Prostate Detect (FDA 510(k), 10 February 2025), an IVD software that helps detect small and rare cancers on prostate biopsy. In its validation study it reported a 99.6% positive predictive value (PPV) for the cancer heatmap and detected missed cancer in 13% of a consecutive cohort of patients initially diagnosed as benign.
The genuine paradigm shift, however, arrived with ArteraAI Prostate (FDA De Novo, 13 August 2025). This is the first AI-based prognostic/predictive test in digital pathology and opened an entirely new product-code category. Using a multimodal approach (digitized biopsy image plus clinical data), it identifies which patients with localized prostate cancer will benefit from short-term hormone therapy. Crucially, its development and validation rest on multiple phase-3 randomized controlled trials (RCTs) with follow-up of up to 15 years — and it outperformed standard models in predicting distant metastasis, biochemical failure, and prostate-cancer-specific mortality. This represents a move from "did it see the tumor?" to "which treatment helps this patient?"
On the infrastructure side, clearances also expanded: Roche's VENTANA DP 200 scanner (510(k), June 2024), DICOM diagnostic use for Sectra with the Leica Aperio GT 450 DX (2024), and multi-scanner support for PathAI AISight Dx (2024-2025). The first WSI primary-diagnosis authorization, meanwhile, remains the 2017 Philips IntelliSite.
Clearance ≠ Proof of Benefit
FDA 510(k)/de novo authorizations mostly rest on retrospective reader and validation studies; they establish that a product is safe and effective, but not that it improves patient outcomes (survival, mortality). ArteraAI's phase-3 RCT foundation is a strong exception to this rule.
Foundation models: the scale race and the "no single best" reality
The biggest technical leap in computational pathology over the past two years is foundation models — large neural networks that learn self-supervised from millions of unlabeled slides and can then be adapted to many tasks. Virchow (Paige, Nature Medicine, October 2024) was the largest at the time: a 632-million-parameter vision transformer (ViT) trained on 1.5 million WSI from roughly 100,000 patients. It reported a specimen-level AUC of 0.95 across 9 common and 7 rare cancers in pan-cancer detection, and surpassed in-production tissue-specific clinical models for some rare variants.
A scale race followed: Virchow2 (3.1 million WSI from 225,401 patients), Microsoft/Providence's Prov-GigaPath (1.3 billion patches across 31 tissue types; +23% over standard for EGFR-mutation prediction), and Bioptimus's 1.1-billion-parameter H-Optimus-0. But the assumption that "bigger model = better" required the field's most important correction.
That correction came from the independent benchmark by Neidlinger and colleagues (Nature Biomedical Engineering, 2025): 19 foundation models, 13 cohorts, 6,818 patients / 9,528 slides (lung, colorectal, gastric, breast). The result is striking: the vision-language model CONCH delivered the highest overall performance, with Virchow2 a close second — but the advantage of all models weakened on low-data and low-prevalence tasks. Moreover, the best model varied by cancer type: CONCH for gastric and non-small-cell lung cancer, Virchow2 for colorectal, BiomedCLIP for breast. In short, there is no single "best" foundation model; performance is task- and tissue-dependent. This establishes the need to choose models on use-case-specific evidence rather than commercial claims.
Aggregate diagnostic accuracy: the strongest number, with the most critical caveat
The most comprehensive numerical basis for AI's general accuracy in pathology is the systematic review and diagnostic-test-accuracy meta-analysis by McGenity and colleagues (NPJ Digital Medicine, 2024). Data distilled from 2,976 studies into 100 reviews, and then into 48 meta-analyses, span more than 152,000 WSI across many diseases. The pooled results are impressive: sensitivity 96.3% (CI 94.1-97.7), specificity 93.3% (CI 90.5-95.4). For gastrointestinal pathology sensitivity is 93% / specificity 94%, and for uropathology 95% / 96%.
But the most important message of this paper is not the accuracy figure — it is the caveat behind it: 99% of the studies carried high or unclear risk of bias or applicability concerns in at least one domain on QUADAS-2. High accuracy is reported, but the methodology producing it is largely weak. In the authors' own emphasis, the field requires "much more rigorous evaluation." This meta-analysis is openly available as full text in PMC and contains no retraction or erratum; its numerical basis is therefore foregrounded in this article.
Task-specific meta-analyses paint a more mature picture. For HER2 immunohistochemistry classification in breast tissue, AI has reached a pooled sensitivity of 0.97 (CI 0.96-0.98), specificity 0.82 (0.73-0.88), and an AUC of 0.98. On narrow, well-defined tasks, AI has come closest to true maturity in pathology.
| Evidence source | Scope | Key result | Critical caveat |
|---|---|---|---|
| McGenity meta-analysis (2024) | 48 meta-analyses, >152,000 WSI | Sensitivity 96.3% / Specificity 93.3% | 99% of studies at high/unclear bias (QUADAS-2) |
| Neidlinger benchmark (2025) | 19 models, 6,818 patients | CONCH overall leader, Virchow2 second | Advantage vanishes on low data; best model varies by tissue |
| Virchow (2024) | 1.5M WSI, 16 cancer types | Pan-cancer AUC 0.95 | Retrospective; limited external validation |
| HER2 IHC meta-analysis (2024) | Task-specific | AUC 0.98; sensitivity 0.97 | Narrow task; limited generalizability |
Integration into clinical workflow: feasible but slow
Fully digital "sign-out" — the pathologist rendering the entire diagnosis from the screen without returning to glass slides — is technically feasible. One center began without even a single clinical scanner and, as of September 2025, signs out all of its clinical cases digitally. Yet adoption is slow: as of late 2024, only about 10% of laboratories had transitioned to a fully digital workflow, while roughly one-third of clinical labs in the United States had initiated or planned the process.
The benefits of the transition are concrete: some hospitals report an approximately one-third reduction in turnaround time, a decline in immunohistochemistry use within AI-assisted workflows, and increased diagnostic confidence. The largest barrier, by contrast, is cost: a complete solution comprising a high-capacity scanner, storage servers, and high-resolution displays is a six-figure investment for most laboratories and is difficult to justify for budget-constrained units.
Generative and agentic AI: the step on the horizon
The rising headline of 2026 is generative and agentic models. Systems such as PathChat, CONCH, PRISM, MUSK, and TITAN aim to transform the pathologist's workflow with capabilities like automated report generation, visual question-answering, and slide navigation. Honesty is essential here, however: as of 2026, these technologies are at the research stage and are not approved for clinical use. The promise is large, but clinical validation is only beginning.
Limits and an honest frame: what has been proven, and what has not
What has been proven: Task-specific detection (prostate cancer detection, HER2/IHC classification) significantly improves pathologist sensitivity in retrospective reader studies — with Paige Prostate, sensitivity rose from 88.7% to 96.6%, with an approximately 70% reduction in false negatives. Foundation models learn powerful representations even under data-constrained conditions. Fully digital sign-out is technically feasible.
What has not been proven, or remains weak:
Prospective RCT evidence remains very sparse. Most approvals rest on retrospective studies. The first large multicenter prospective RCT ("AI-Assisted Pathologist Performance Improvement," NCT07291362) only began in November 2025 and will conclude around 2027. Randomized patient-outcome evidence at the level of mortality or survival therefore does not yet exist.
The external-validation gap: Training cohorts (particularly TCGA) come predominantly from populations of European ancestry and a limited number of academic centers; sensitivity may decline in under-represented groups.
Distribution shift (domain shift): Differences in scanner optics, staining protocol, and compression create greater variability between centers than within a center. Staining variation is the most important technical barrier to generalization; color normalization can introduce artifacts and demands frequent retraining.
Shortcut learning (batch effect): A model may learn an institution-specific technical signature rather than a clinical feature — inflating performance in the lab and degrading it in the real world. Automation bias is the risk that clinicians, especially when inexperienced or under time pressure, place the AI's suggestion ahead of their own judgment.
The shift "from accuracy to reliability"
According to the 2026 expert consensus, the focus must shift from raw accuracy metrics toward reliability. Human-level performance in an academic, controlled setting does not automatically translate into real clinical benefit. Active localization — on-site validation, post-hoc calibration, and site-specific fine-tuning — is not a luxury but a necessity.
Conclusion
Between 2024 and 2026, artificial intelligence in digital and computational pathology moved from an experimental curiosity to a tangible reality across regulatory, technical, and clinical dimensions. Approvals proliferated and accelerated; with ArteraAI, the field advanced from tumor detection to treatment prediction; foundation models succeeded in learning powerful, general-purpose representations. In task-specific applications (prostate detection, HER2 classification), AI measurably supports pathologist performance.
Against this, an honest reading cannot conceal the gaps: although our strongest numerical basis (the McGenity meta-analysis) shows high accuracy, 99% of the underlying studies carry methodological bias; there is no such thing as a single best foundation model, with performance being task-dependent; and, most critically, large prospective randomized evidence showing improved patient outcomes has not yet matured — its first fruits are not expected until 2027. Distribution shift, the lack of external validation, and automation bias remain genuine obstacles still to be overcome in clinical deployment.
The practical conclusion is this: AI is most reliably positioned as a tool that augments rather than replaces human expertise in pathology. Every clinical deployment should rest not on commercial accuracy claims but on evidence validated, calibrated, and continuously monitored within its own laboratory. The field's maturation will come not from bigger models, but from far better evidence.
References
- Vorontsov E, et al. A foundation model for clinical-grade computational pathology and rare cancers detection (Virchow). Nature Medicine. 2024. site
- McGenity C, et al. Artificial intelligence in digital pathology: a systematic review and meta-analysis of diagnostic test accuracy. NPJ Digital Medicine. 2024. site
- Neidlinger P, et al. Benchmarking foundation models as feature extractors for weakly-supervised computational pathology. Nature Biomedical Engineering. 2025. site
- Ibex Medical Analytics Receives First FDA 510(k) Clearance (Ibex Prostate Detect). Business Wire. 2025. site
- Artera Receives U.S. FDA De Novo Marketing Authorization (ArteraAI Prostate). Business Wire. 2025. site
- FDA grants de novo authorization to ArteraAI Prostate. Urology Times. 2025. site
- Computational Pathology in the Era of Emerging Foundation and Agentic AI — International Expert Perspectives. arXiv. 2026. site
- Digital Pathology Imaging AI in Cancer Research and Clinical Trials: An NCI Workshop Report. NCI / PMC. 2025. site
- Application of AI and digital tools in cancer pathology. The Lancet Digital Health. 2025. site
- Roche receives FDA clearance on its digital pathology solution (VENTANA DP 200). Roche Diagnostics. 2024. site
- AI-Assisted Pathologist Performance Improvement (NCT07291362). ClinicalTrials.gov. 2025. site