AMQA: An Adversarial Dataset for Benchmarking Bias of LLMs in Medicine and Healthcare

📅 2025-05-26

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

While large language models (LLMs) achieve expert-level performance in medical diagnosis, their systematic biases across race, gender, and socioeconomic status pose serious patient safety risks; existing evaluation benchmarks lack automation, scalability, and clinical context alignment. Method: We introduce AMQA—the first adversarial bias benchmark for medical question answering—built via a multi-agent collaborative generation framework integrating USMLE knowledge injection and fine-grained demographic attribute annotation, yielding 4,806 high-quality adversarial QA pairs. Contribution/Results: Our empirical analysis reveals >10-percentage-point accuracy disparities across demographic groups in mainstream LLMs on USMLE-style questions—demonstrating previously unquantified bias. AMQA improves bias detection sensitivity by 15% over the CPV benchmark. Even state-of-the-art models like GPT-4.1 exhibit significant fairness violations. All data and code are publicly released to advance trustworthy, equitable AI in healthcare.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) are reaching expert-level accuracy on medical diagnosis questions, yet their mistakes and the biases behind them pose life-critical risks. Bias linked to race, sex, and socioeconomic status is already well known, but a consistent and automatic testbed for measuring it is missing. To fill this gap, this paper presents AMQA -- an Adversarial Medical Question-Answering dataset -- built for automated, large-scale bias evaluation of LLMs in medical QA. AMQA includes 4,806 medical QA pairs sourced from the United States Medical Licensing Examination (USMLE) dataset, generated using a multi-agent framework to create diverse adversarial descriptions and question pairs. Using AMQA, we benchmark five representative LLMs and find surprisingly substantial disparities: even GPT-4.1, the least biased model tested, answers privileged-group questions over 10 percentage points more accurately than unprivileged ones. Compared with the existing benchmark CPV, AMQA reveals 15% larger accuracy gaps on average between privileged and unprivileged groups. Our dataset and code are publicly available at https://github.com/XY-Showing/AMQA to support reproducible research and advance trustworthy, bias-aware medical AI.

Problem

Research questions and friction points this paper is trying to address.

Measuring bias in LLMs for medical diagnosis questions

Lack of automated testbed for bias evaluation in healthcare

Disparities in accuracy between privileged and unprivileged groups

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adversarial dataset for medical bias evaluation

Multi-agent framework generates diverse adversarial questions

Automated large-scale bias benchmarking for LLMs

🔎 Similar Papers

No similar papers found.