MEDEQUALQA: Evaluating Biases in LLMs with Counterfactual Reasoning

📅 2025-10-09

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

This study investigates reasoning instability and implicit bias in large language models (LLMs) for clinical decision support induced by patient pronoun variation (“he”/“she”/“they”). We propose MEDEQUALQA, the first controlled, counterfactual evaluation framework for medical LLMs: it systematically perturbs only pronouns while holding symptoms, diagnoses, and clinical context invariant, enabling rigorous assessment of semantic consistency in reasoning paths. Leveraging clinical vignette symptom ablation, GPT-4.1-generated reasoning traces, and semantic textual similarity (STS) quantification across ~69,000 test instances, we find overall high reasoning consistency (STS > 0.80), yet uncover significant localized biases in risk-factor referencing, clinical guideline citation, and prioritization. This work establishes the first quantifiable, attribution-based analysis of pronoun-driven reasoning instability in medicine, revealing a novel implicit bias pattern—“diagnostically consistent but reasoning-divergent”—where final diagnoses remain unchanged despite systematic shifts in underlying clinical justification.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) are increasingly deployed in clinical decision support, yet subtle demographic cues can influence their reasoning. Prior work has documented disparities in outputs across patient groups, but little is known about how internal reasoning shifts under controlled demographic changes. We introduce MEDEQUALQA, a counterfactual benchmark that perturbs only patient pronouns (he/him, she/her, they/them) while holding critical symptoms and conditions (CSCs) constant. Each clinical vignette is expanded into single-CSC ablations, producing three parallel datasets of approximately 23,000 items each (69,000 total). We evaluate a GPT-4.1 model and compute Semantic Textual Similarity (STS) between reasoning traces to measure stability across pronoun variants. Our results show overall high similarity (mean STS >0.80), but reveal consistent localized divergences in cited risk factors, guideline anchors, and differential ordering, even when final diagnoses remain unchanged. Our error analysis highlights certain cases in which the reasoning shifts, underscoring clinically relevant bias loci that may cascade into inequitable care. MEDEQUALQA offers a controlled diagnostic setting for auditing reasoning stability in medical AI.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM reasoning biases through controlled pronoun perturbations

Measuring reasoning stability across demographic variations in clinical vignettes

Identifying clinically relevant bias loci that may affect equitable care

Innovation

Methods, ideas, or system contributions that make the work stand out.

Counterfactual benchmark perturbs patient pronouns only

Evaluates reasoning stability with semantic textual similarity

Identifies bias loci through controlled demographic changes

🔎 Similar Papers

Counterfactual Debating with Preset Stances for Hallucination Elimination of LLMs