Evaluating the Presence of Sex Bias in Clinical Reasoning by Large Language Models

📅 2026-02-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates implicit gender bias in large language models (LLMs) during clinical reasoning, which may exacerbate healthcare disparities. Using 50 multi-specialty clinical cases—crafted by physicians and explicitly designed to be gender-agnostic—the authors systematically evaluate the inherent gender assignment tendencies of four leading LLMs under conditions devoid of explicit gender cues. Through a novel integration of controlled experiments, temperature modulation, abstention mechanisms, and confidence interval analysis, the work quantifies for the first time how model configurations influence bias manifestation. Results reveal that all evaluated models exhibit significant and consistent gender biases: ChatGPT, DeepSeek, and Claude predominantly assign female gender, whereas Gemini shows a male bias. Notably, these disparities persist even when models are permitted to abstain from gender assignment, indicating a robust and concerning influence of implicit bias on diagnostic reasoning.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) are increasingly embedded in healthcare workflows for documentation, education, and clinical decision support. However, these systems are trained on large text corpora that encode existing biases, including sex disparities in diagnosis and treatment, raising concerns that such patterns may be reproduced or amplified. We systematically examined whether contemporary LLMs exhibit sex-specific biases in clinical reasoning and how model configuration influences these behaviours. We conducted three experiments using 50 clinician-authored vignettes spanning 44 specialties in which sex was non-informative to the initial diagnostic pathway. Four general-purpose LLMs (ChatGPT (gpt-4o-mini), Claude 3.7 Sonnet, Gemini 2.0 Flash and DeepSeekchat). All models demonstrated significant sex-assignment skew, with predicted sex differing by model. At temperature 0.5, ChatGPT assigned female sex in 70% of cases (95% CI 0.66-0.75), DeepSeek in 61% (0.57-0.65) and Claude in 59% (0.55-0.63), whereas Gemini showed a male skew, assigning a female sex in 36% of cases (0.32-0.41). Contemporary LLMs exhibit stable, model-specific sex biases in clinical reasoning. Permitting abstention reduces explicit labelling but does not eliminate downstream diagnostic differences. Safe clinical integration requires conservative and documented configuration, specialty-level clinical data auditing, and continued human oversight when deploying general-purpose models in healthcare settings.
Problem

Research questions and friction points this paper is trying to address.

sex bias
clinical reasoning
large language models
healthcare AI
diagnostic bias
Innovation

Methods, ideas, or system contributions that make the work stand out.

sex bias
clinical reasoning
large language models
healthcare AI
algorithmic fairness
🔎 Similar Papers
No similar papers found.
I
Isabel Tsintsiper
Oxford Digital Health Labs, Nuffield Department of Women’s and Reproductive Health, University of Oxford, Oxford, UK
S
Sheng Wong
Oxford Digital Health Labs, Nuffield Department of Women’s and Reproductive Health, University of Oxford, Oxford, UK
B
B. Albert
Oxford Digital Health Labs, Nuffield Department of Women’s and Reproductive Health, University of Oxford, Oxford, UK
S
S. P. Brennecke
Pregnancy Research Centre, Department of Maternal Fetal Medicine, Royal Women’s Hospital, Victoria, Australia
Gabriel Davis Jones
Gabriel Davis Jones
University of Oxford
Maternal and Neonatal HealthNeuroscienceComputer ScienceArtifical IntelligenceGlobal Health