MedVoiceBias: A Controlled Study of Audio LLM Behavior in Clinical Decision-Making

📅 2025-11-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study identifies systemic bias in audio large language models (Audio-LLMs) for clinical decision-making, arising from paralinguistic cues—including age, gender, and affect—that induce healthcare inequities. We develop a rigorous evaluation framework comprising 170 clinical cases and employ text-to-speech synthesis to generate multidimensional voice variants. Integrating chain-of-thought prompting with explicit reasoning, we quantify modality-specific bias across audio and text inputs. Experiments reveal that audio input induces up to a 35% disparity in surgical recommendations compared to text; one leading model reduces surgery recommendations for elderly patients by 80%, exhibiting a 12% age-related bias; affect recognition accuracy is significantly degraded. Critically, we provide the first empirical evidence that the audio modality itself—distinct from semantic content—constitutes an independent source of bias. To address this, we propose a “bias-aware architecture” design paradigm, establishing both methodological foundations and empirical validation for trustworthy medical AI.

Technology Category

Application Category

📝 Abstract
As large language models transition from text-based interfaces to audio interactions in clinical settings, they might introduce new vulnerabilities through paralinguistic cues in audio. We evaluated these models on 170 clinical cases, each synthesized into speech from 36 distinct voice profiles spanning variations in age, gender, and emotion. Our findings reveal a severe modality bias: surgical recommendations for audio inputs varied by as much as 35% compared to identical text-based inputs, with one model providing 80% fewer recommendations. Further analysis uncovered age disparities of up to 12% between young and elderly voices, which persisted in most models despite chain-of-thought prompting. While explicit reasoning successfully eliminated gender bias, the impact of emotion was not detected due to poor recognition performance. These results demonstrate that audio LLMs are susceptible to making clinical decisions based on a patient's voice characteristics rather than medical evidence, a flaw that risks perpetuating healthcare disparities. We conclude that bias-aware architectures are essential and urgently needed before the clinical deployment of these models.
Problem

Research questions and friction points this paper is trying to address.

Audio LLMs show significant clinical decision variations based on voice characteristics
Models exhibit modality bias with surgical recommendations differing by 35%
Age disparities up to 12% persist despite chain-of-thought prompting
Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluated audio LLMs using synthesized voice profiles
Uncovered modality bias in surgical recommendations
Proposed bias-aware architectures for clinical deployment
🔎 Similar Papers
No similar papers found.
Zhi Rui Tam
Zhi Rui Tam
NTU / Appier
natural language processing
Y
Yun-Nung Chen
National Taiwan University, Taipei, Taiwan