🤖 AI Summary
While large language models (LLMs) achieve expert-level performance in medical diagnosis, their systematic biases across race, gender, and socioeconomic status pose serious patient safety risks; existing evaluation benchmarks lack automation, scalability, and clinical context alignment. Method: We introduce AMQA—the first adversarial bias benchmark for medical question answering—built via a multi-agent collaborative generation framework integrating USMLE knowledge injection and fine-grained demographic attribute annotation, yielding 4,806 high-quality adversarial QA pairs. Contribution/Results: Our empirical analysis reveals >10-percentage-point accuracy disparities across demographic groups in mainstream LLMs on USMLE-style questions—demonstrating previously unquantified bias. AMQA improves bias detection sensitivity by 15% over the CPV benchmark. Even state-of-the-art models like GPT-4.1 exhibit significant fairness violations. All data and code are publicly released to advance trustworthy, equitable AI in healthcare.
📝 Abstract
Large language models (LLMs) are reaching expert-level accuracy on medical diagnosis questions, yet their mistakes and the biases behind them pose life-critical risks. Bias linked to race, sex, and socioeconomic status is already well known, but a consistent and automatic testbed for measuring it is missing. To fill this gap, this paper presents AMQA -- an Adversarial Medical Question-Answering dataset -- built for automated, large-scale bias evaluation of LLMs in medical QA. AMQA includes 4,806 medical QA pairs sourced from the United States Medical Licensing Examination (USMLE) dataset, generated using a multi-agent framework to create diverse adversarial descriptions and question pairs. Using AMQA, we benchmark five representative LLMs and find surprisingly substantial disparities: even GPT-4.1, the least biased model tested, answers privileged-group questions over 10 percentage points more accurately than unprivileged ones. Compared with the existing benchmark CPV, AMQA reveals 15% larger accuracy gaps on average between privileged and unprivileged groups. Our dataset and code are publicly available at https://github.com/XY-Showing/AMQA to support reproducible research and advance trustworthy, bias-aware medical AI.