🤖 AI Summary
This work addresses the limited realism in current robustness evaluations of large audio models, which often rely on simplified noise that fails to capture the complexity of real-world acoustic environments. To bridge this gap, the authors propose RSA-Bench, a high-fidelity auditory scene simulation benchmark that naturally overlays authentic soundscapes—such as those from ranches, extreme weather, and classrooms—onto clean speech, thereby constructing a multi-source, multi-level interference evaluation framework. Through six tasks spanning speech recognition, semantic understanding, and reasoning, the study systematically uncovers three critical phenomena: a perception–cognition gap (robust performance on low-level tasks but collapse in high-level reasoning), scene sensitivity (human-like acoustic interference being more detrimental than mechanical noise), and a denoising paradox (standard denoising techniques degrading performance by introducing semantic distortions). This approach transcends conventional single-source or Gaussian noise evaluation paradigms.
📝 Abstract
While Audio Large Models (ALMs) have achieved remarkable proficiency, their robustness remains brittle in real-world deployment. Existing evaluations largely rely on synthetic Gaussian noise or simplistic single-source interference, failing to capture the intricate, multi-layered acoustic dynamics -- or ``Acoustic Ecology''-- that characterize authentic physical environments. To bridge this ecological gap, we introduce \textbf{RSA-Bench}, a comprehensive robustness benchmark designed to stress-test ALLMs through high-fidelity auditory scene simulations. Unlike traditional methods, we construct evaluation samples by naturally superimposing diverse environmental soundscapes -- spanning \textit{Pasture}, \textit{Extreme Weather}, \textit{Classroom}, and \textit{Outdoors} -- onto clean speech signals across a spectrum of interference intensities. By evaluating models on six core tasks ranging from fundamental perception to complex reasoning, our study unveils three macro-level insights: \textbf{(I) The Perception-Cognition Gap:} Models maintain relative resilience in low-level recognition but suffer a \textbf{functional collapse} in high-order reasoning tasks under stress; \textbf{(II) Scenario Sensitivity:} ``Vocal-like''interference (e.g., background laughter) proves significantly more destructive than mechanical noise, challenging the model's auditory attention mechanisms; and \textbf{(III) The Denoising Paradox:} Standard speech enhancement often exacerbates performance degradation, as ALLMs prove highly sensitive to the semantic distortions introduced by denoising artifacts.