,

AI Ethics and Algorithmic Bias in Healthcare: Evidence, Regulation, and Limits as of 2026

As AI spreads rapidly through clinical practice, algorithmic bias has emerged as a real and measurable problem; this review weighs the evidence from classic risk scores, devices, and generative models in a balanced way, alongside the regulatory frameworks now in force.

By Cem Akaltun, MD · Updated · ~12 min read Ethics Algorithmic Bias Data Privacy

Artificial intelligence has entered everyday medicine not as a promise of the future but as a tool already in active use. As of early 2026, the number of AI-enabled medical devices authorized by the U.S. Food and Drug Administration (FDA) has surpassed 1,350, the vast majority of them in radiology. This rapid diffusion raises an equally pressing question: do these systems deliver equal benefit to everyone, or do they scale up and entrench existing health inequities? Algorithmic bias is no longer a theoretical worry; it is a measurable phenomenon documented repeatedly across classic risk scores, medical devices, and large language models alike. This article aims to separate, without exaggeration, what has been proven from what remains uncertain.

What Is Algorithmic Bias, and Where Does It Come From?

Algorithmic bias occurs when a model systematically produces different—and usually worse—outcomes for particular demographic groups (race, ethnicity, sex, age, socioeconomic status). Its source most often lies not in the model's mathematics but in the data and definitions it is given. The most instructive example is the study by Obermeyer and colleagues, published in Science in 2019. A population-health risk algorithm applied to roughly 200 million people in the United States used not health status directly but healthcare spending as a proxy label to estimate disease burden. Because less money had historically been spent on Black patients, the algorithm systematically rated equally sick Black patients as "healthier" and referred them less often to additional care programs.

The real significance of this work is that it not only diagnosed the problem but showed it to be solvable. When the label was redefined to reflect disease burden directly rather than spending, racial bias was reduced by 84%. The finding was replicated in an independent cohort of 3.7 million people with the health data firm Optum. In other words, the source of bias is usually not "malicious code" but an overlooked design decision about what the model is taught.

Device-Level Bias: The Pulse Oximetry Case

Bias can be built not only into complex machine-learning models but into simple measurement devices. A study by Sjoding and colleagues, published in the New England Journal of Medicine in 2020, showed that pulse oximetry systematically errs in patients with darker skin. While the device read oxygen saturation at 92–96%, "occult hypoxemia"—a true arterial value below 88%—occurred about three times more often in Black than in white patients. In a multicenter cohort the rate was 17% versus 6.2%, and in the Michigan cohort 11.7% versus 3.6%.

This served as both a warning and a trigger for regulatory action. In its January 2025 draft guidance on pulse oximeters, the FDA recommended that manufacturers collect clinical data across a range of skin tones and that devices demonstrating equal performance be listed on a publicly available page. That a racial discrepancy in a measurement tool produced a concrete regulatory response is a sign that the field is maturing.

Under-Representation Is Documented

In a scoping review of 692 FDA-cleared AI-enabled medical devices, only 3.6% reported race/ethnicity data and fewer than 1% reported socioeconomic data. A device that does not report demographic breakdowns cannot prove it works fairly across groups.

The Generative AI Era: The Evidence Cuts Both Ways

Since 2024, large language models (LLMs) have transformed the field, and bias in these models has become its most contested topic. Here honesty is essential: the evidence is not one-directional. Results differ markedly by task type, which is precisely why conflicting findings must be presented side by side.

The evidence for bias is strong. A 2025 systematic review by Hanna and colleagues in the International Journal for Equity in Health examined 24 studies from 2018–2024 and found bias in 91.7% of them (22 studies); gender bias appeared in 93.7% and racial/ethnic bias in 90.9%. In an evaluation published by Zack and colleagues in The Lancet Digital Health in 2024, GPT-4 failed to model the demographic distribution of diseases accurately, generated stereotyped clinical vignettes, and recommended advanced imaging less often for under-represented racial groups. A 2025 comparison in npj Digital Medicine across four models (Claude, ChatGPT, Gemini, and a local LLaMA variant) found that when a patient's race was stated, the models frequently recommended lower-quality treatment in psychiatric scenarios.

Yet over the same period, well-conducted studies finding no bias were also published. In work by Young and colleagues in the journal Pain in 2024, race, ethnicity, and sex did not influence the opioid recommendations of GPT-4 or Gemini across 480 constructed cases spanning every race–sex combination. In an emergency-department pain-management study by Fischetti and colleagues in the Journal of Emergency Medicine in 2026, race, language, and socioeconomic status mostly did not change recommendations. The most striking counterexample came from oncology: in an analysis by Roach and colleagues in JCO Clinical Cancer Informatics in 2025, covering 5,708 prostate-cancer patients from five randomized Phase III trials, a multimodal AI algorithm produced a similarly strong prognostic signal in both men of African ancestry (n=948) and non-African subgroups, with no evidence of algorithmic bias.

Study / SourceTask typeBias finding
Hanna et al. 2025 (systematic review)Mixed (24 studies)Bias found in 91.7%
Zack et al. 2024 (GPT-4)Clinical vignette / recommendationBias present
npj Dig. Med. 2025 (psychiatry)Treatment recommendationBias in treatment; minimal in diagnosis
Young et al. 2024 (opioids)Structured recommendationNo bias
Fischetti et al. 2026 (ED pain)Structured recommendationLargely no bias
Roach et al. 2025 (prostate MMAI)Prognostic algorithmNo bias

The takeaway is clear: bias appears consistently in free-text and generative tasks (clinical vignettes, discharge instructions, report generation), whereas results are mixed in structured diagnostic and recommendation tasks. What determines the outcome is the design of the task and the engineering of the prompt as much as the brand of the model.

The Limits of Clinical Deployment: Generalization and Dataset Shift

Bias is not the only danger; a model that performs well in its developer's environment but fails in a real hospital is an equally critical risk. The best-known example is the Epic Sepsis Model. While the developer reported an AUC of 0.76–0.83, performance fell to roughly AUC 0.63 in independent external validation; in a 2023 evaluation across two county emergency departments, the model's sensitivity was only 14.7% and its positive predictive value 7.6%. Few examples illustrate better how misleading it can be to trust developer-reported metrics uncritically.

To this must be added dataset shift. A model may not retain at deployment the performance it had at approval; as disease definitions, coding systems (for instance the transition from ICD-9 to ICD-10), clinical practice, and patient demographics change, a model can become obsolete. Static validation is therefore insufficient; continuous, proactive post-deployment monitoring is required.

The Regulatory Framework: Bias Is Now a Requirement

The period 2024–2025 turned bias from a well-meaning aspiration into a concrete regulatory obligation. In its lifecycle-management draft guidance of 7 January 2025, the FDA asks manufacturers to demonstrate that a device delivers similar benefit across all relevant demographic groups, and to provide data provenance, bias analysis, and a Predetermined Change Control Plan (PCCP) for updates. The World Health Organization (WHO), in its 18 January 2024 guidance on large multi-modal models (LMMs), issued more than 40 recommendations and emphasized mandatory independent post-release audits with outcomes disaggregated by user type (such as age, race, or disability).

The European Union's AI Act brings medical AI directly into scope: medical-device AI under the MDR/IVDR automatically falls into the "high-risk" class, with transparency rules taking effect on 2 August 2026, while full-compliance deadlines for embedded high-risk medical-device AI have been deferred to later years. It is also becoming clear that the choice of fairness metric is a normative rather than a technical decision: group fairness and individual fairness may not be mathematically satisfiable at the same time, and the answer to "which fairness?" is a political and ethical choice.

An Honest Balance Sheet: What We Know and What We Don't

What is certain is this: algorithmic bias is real, measurable, and reproducible; proxy-label selection is a principal structural source; and interventions such as relabeling can yield large improvements. Under-representation in data is documented, and regulators now demand concrete evidence of equitable performance.

What remains uncertain is no less important. The effect of LLM bias on real patient outcomes (mortality, morbidity) has so far largely been confined to model-evaluation and vignette studies; randomized evidence based on patient outcomes is rare. The standardized, validated, and durable effectiveness of bias-mitigation strategies has not been demonstrated. Moreover, one methodological caveat must not be missed: publication bias. Because negative (no-bias) results are published less often, the literature's impression that "bias is everywhere" may in reality be overstated. Noting this is not to minimize the problem but to stay faithful to the evidence.

Conclusion

The problem of bias in healthcare AI is neither a detail to be dismissed nor a flaw that warrants rejecting the technology outright. Read in a balanced way, the evidence says two things at once: bias genuinely arises from poorly designed labels and under-represented data; yet tools trained on representative data and carefully designed can also perform fairly across groups. The responsible path is neither blind adoption nor wholesale refusal. Demanding evidence of performance disaggregated by demographics, making external validation and post-deployment monitoring standard practice, placing conflicting evidence honestly side by side, and always leaving the final decision to the clinician's judgment—on this discipline depends whether AI becomes a force that narrows rather than widens inequity in healthcare. The tool is powerful; what makes it fair is the care of the person who wields it.

References

  1. Obermeyer Z, Powers B, Vogeli C, Mullainathan S. Dissecting racial bias in an algorithm used to manage the health of populations. Science. 2019. site
  2. Sjoding MW, Dickson RP, Iwashyna TJ, et al. Racial Bias in Pulse Oximetry Measurement. New England Journal of Medicine. 2020. site
  3. Zack T, Lehman E, Suzgun M, et al. Assessing the potential of GPT-4 to perpetuate racial and gender biases in health care. The Lancet Digital Health. 2024. site
  4. Hanna M, et al. Evaluating and addressing demographic disparities in medical large language models: a systematic review. International Journal for Equity in Health. 2025. site
  5. Bouguettaya A, Aboujaoude E, et al. Racial bias in AI-mediated psychiatric diagnosis and treatment. npj Digital Medicine. 2025. site
  6. Young CC, Succi MD, et al. Racial, ethnic, and sex bias in large language model opioid recommendations for pain management. Pain. 2024. site
  7. Roach M, et al. Assessing Algorithmic Fairness With a Multimodal AI Model in Men of African and Non-African Origin on NRG Oncology Prostate Cancer Phase III Trials. JCO Clinical Cancer Informatics. 2025. site
  8. World Health Organization (WHO). Ethics and governance of AI for health: Large multi-modal models. WHO Guidance. 2024. site
  9. U.S. FDA. Artificial Intelligence-Enabled Device Software Functions: Lifecycle Management and Marketing Submission Recommendations (Draft Guidance). FDA. 2025. site
  10. npj Digital Medicine. Health Disparities and Reporting Gaps in AI-Enabled Medical Devices: A Scoping Review of 692 FDA Approvals. npj Digital Medicine. 2024. site
  11. Anderson JW, Visweswaran S. Algorithmic individual fairness and healthcare: a scoping review. JAMIA Open. 2024. site
  12. Djiberou Mahamadou AJ, Trotsyuk AA. Revisiting Technical Bias Mitigation Strategies. Annual Review of Biomedical Data Science. 2025. site
Disclaimer: This content is for general informational and educational purposes only. The ethical and legal dimensions of healthcare AI vary by country, institution, and applicable law; for binding obligations, consult the relevant regulatory authority and legal counsel.