When Audio and Text Disagree: Revealing Text Bias in Large Audio-Language Models

📅 2025-08-21

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

Large audio-language models (LALMs) exhibit pervasive text bias under audio-text conflict inputs, severely degrading audio understanding and undermining multimodal system reliability. To address this, we introduce MCR-BENCH—the first modality-priority evaluation benchmark—systematically uncovering and quantifying text bias in LALMs. Integrating large-scale audio comprehension evaluation, supervised fine-tuning, and confidence calibration analysis, we rigorously dissect the origins of bias and the mechanisms behind model overconfidence. Experiments reveal that mainstream LALMs overwhelmingly ignore audio evidence, relying almost exclusively on textual cues for predictions; while fine-tuning mitigates bias to some extent, erroneous outputs remain highly confident. This work establishes a novel, reproducible benchmark for multimodal alignment, delivers foundational insights into unreliable audio grounding, and provides actionable, empirically validated pathways toward trustworthy audio-language understanding.

Technology Category

Application Category

📝 Abstract

Large Audio-Language Models (LALMs) are enhanced with audio perception capabilities, enabling them to effectively process and understand multimodal inputs that combine audio and text. However, their performance in handling conflicting information between audio and text modalities remains largely unexamined. This paper introduces MCR-BENCH, the first comprehensive benchmark specifically designed to evaluate how LALMs prioritize information when presented with inconsistent audio-text pairs. Through extensive evaluation across diverse audio understanding tasks, we reveal a concerning phenomenon: when inconsistencies exist between modalities, LALMs display a significant bias toward textual input, frequently disregarding audio evidence. This tendency leads to substantial performance degradation in audio-centric tasks and raises important reliability concerns for real-world applications. We further investigate the influencing factors of text bias, and explore mitigation strategies through supervised finetuning, and analyze model confidence patterns that reveal persistent overconfidence even with contradictory inputs. These findings underscore the need for improved modality balance during training and more sophisticated fusion mechanisms to enhance the robustness when handling conflicting multi-modal inputs. The project is available at https://github.com/WangCheng0116/MCR-BENCH.

Problem

Research questions and friction points this paper is trying to address.

Evaluating text bias in audio-language models with conflicting inputs

Assessing performance degradation in audio-centric tasks due to bias

Investigating mitigation strategies for multimodal information conflicts

Innovation

Methods, ideas, or system contributions that make the work stand out.

Developed MCR-BENCH benchmark for audio-text conflicts

Identified text bias in multimodal models through evaluation

Explored supervised finetuning for mitigating modality imbalance

🔎 Similar Papers

No similar papers found.