Evaluating the Presence of Sex Bias in Clinical Reasoning by Large Language Models

📅 2026-02-04

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

This study investigates implicit gender bias in large language models (LLMs) during clinical reasoning, which may exacerbate healthcare disparities. Using 50 multi-specialty clinical cases—crafted by physicians and explicitly designed to be gender-agnostic—the authors systematically evaluate the inherent gender assignment tendencies of four leading LLMs under conditions devoid of explicit gender cues. Through a novel integration of controlled experiments, temperature modulation, abstention mechanisms, and confidence interval analysis, the work quantifies for the first time how model configurations influence bias manifestation. Results reveal that all evaluated models exhibit significant and consistent gender biases: ChatGPT, DeepSeek, and Claude predominantly assign female gender, whereas Gemini shows a male bias. Notably, these disparities persist even when models are permitted to abstain from gender assignment, indicating a robust and concerning influence of implicit bias on diagnostic reasoning.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) are increasingly embedded in healthcare workflows for documentation, education, and clinical decision support. However, these systems are trained on large text corpora that encode existing biases, including sex disparities in diagnosis and treatment, raising concerns that such patterns may be reproduced or amplified. We systematically examined whether contemporary LLMs exhibit sex-specific biases in clinical reasoning and how model configuration influences these behaviours. We conducted three experiments using 50 clinician-authored vignettes spanning 44 specialties in which sex was non-informative to the initial diagnostic pathway. Four general-purpose LLMs (ChatGPT (gpt-4o-mini), Claude 3.7 Sonnet, Gemini 2.0 Flash and DeepSeekchat). All models demonstrated significant sex-assignment skew, with predicted sex differing by model. At temperature 0.5, ChatGPT assigned female sex in 70% of cases (95% CI 0.66-0.75), DeepSeek in 61% (0.57-0.65) and Claude in 59% (0.55-0.63), whereas Gemini showed a male skew, assigning a female sex in 36% of cases (0.32-0.41). Contemporary LLMs exhibit stable, model-specific sex biases in clinical reasoning. Permitting abstention reduces explicit labelling but does not eliminate downstream diagnostic differences. Safe clinical integration requires conservative and documented configuration, specialty-level clinical data auditing, and continued human oversight when deploying general-purpose models in healthcare settings.

Problem

Research questions and friction points this paper is trying to address.

sex bias

clinical reasoning

large language models

healthcare AI

diagnostic bias

Innovation

Methods, ideas, or system contributions that make the work stand out.

sex bias

clinical reasoning

large language models