RSA-Bench: Benchmarking Audio Large Models in Real-World Acoustic Scenarios

📅 2026-01-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limited realism in current robustness evaluations of large audio models, which often rely on simplified noise that fails to capture the complexity of real-world acoustic environments. To bridge this gap, the authors propose RSA-Bench, a high-fidelity auditory scene simulation benchmark that naturally overlays authentic soundscapes—such as those from ranches, extreme weather, and classrooms—onto clean speech, thereby constructing a multi-source, multi-level interference evaluation framework. Through six tasks spanning speech recognition, semantic understanding, and reasoning, the study systematically uncovers three critical phenomena: a perception–cognition gap (robust performance on low-level tasks but collapse in high-level reasoning), scene sensitivity (human-like acoustic interference being more detrimental than mechanical noise), and a denoising paradox (standard denoising techniques degrading performance by introducing semantic distortions). This approach transcends conventional single-source or Gaussian noise evaluation paradigms.

Technology Category

Application Category

📝 Abstract
While Audio Large Models (ALMs) have achieved remarkable proficiency, their robustness remains brittle in real-world deployment. Existing evaluations largely rely on synthetic Gaussian noise or simplistic single-source interference, failing to capture the intricate, multi-layered acoustic dynamics -- or ``Acoustic Ecology''-- that characterize authentic physical environments. To bridge this ecological gap, we introduce \textbf{RSA-Bench}, a comprehensive robustness benchmark designed to stress-test ALLMs through high-fidelity auditory scene simulations. Unlike traditional methods, we construct evaluation samples by naturally superimposing diverse environmental soundscapes -- spanning \textit{Pasture}, \textit{Extreme Weather}, \textit{Classroom}, and \textit{Outdoors} -- onto clean speech signals across a spectrum of interference intensities. By evaluating models on six core tasks ranging from fundamental perception to complex reasoning, our study unveils three macro-level insights: \textbf{(I) The Perception-Cognition Gap:} Models maintain relative resilience in low-level recognition but suffer a \textbf{functional collapse} in high-order reasoning tasks under stress; \textbf{(II) Scenario Sensitivity:} ``Vocal-like''interference (e.g., background laughter) proves significantly more destructive than mechanical noise, challenging the model's auditory attention mechanisms; and \textbf{(III) The Denoising Paradox:} Standard speech enhancement often exacerbates performance degradation, as ALLMs prove highly sensitive to the semantic distortions introduced by denoising artifacts.
Problem

Research questions and friction points this paper is trying to address.

Audio Large Models
Robustness
Acoustic Ecology
Real-World Scenarios
Auditory Scene Simulation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Audio Large Models
Robustness Benchmark
Acoustic Ecology
Auditory Scene Simulation
Denoising Paradox
🔎 Similar Papers
No similar papers found.
Y
Yibo Zhang
Beijing University of Posts and Telecommunications
Liang Lin
Liang Lin
Fellow of IEEE/IAPR, Professor of Computer Science, Sun Yat-sen University
Embodied AICausal Inference and LearningMultimodal Data Analysis
K
Kaiwen Luo
Nanyang Technological University
S
Shilinlu Yan
Beijing University of Posts and Telecommunications
J
Jin Wang
Xidian University
Yaoqi Guo
Yaoqi Guo
Peking University
Software Engineering
Yitian Chen
Yitian Chen
Algorithm scientist, Cardinal Operations
AIdeep learning
Y
Yalan Qin
Shanghai University
Zhenhong Zhou
Zhenhong Zhou
Nanyang Technological University
Large Language ModelAI SafetyLLM Safety
Kun Wang
Kun Wang
Singapore University of Technology and Design
Deep LearningComputer Vision
L
Li Sun
Beijing University of Posts and Telecommunications