RSA-Bench: Benchmarking Audio Large Models in Real-World Acoustic Scenarios

📅 2026-01-15

📈 Citations: 0

✨ Influential: 0

career value

230K/year

🤖 AI Summary

This work addresses the limited realism in current robustness evaluations of large audio models, which often rely on simplified noise that fails to capture the complexity of real-world acoustic environments. To bridge this gap, the authors propose RSA-Bench, a high-fidelity auditory scene simulation benchmark that naturally overlays authentic soundscapes—such as those from ranches, extreme weather, and classrooms—onto clean speech, thereby constructing a multi-source, multi-level interference evaluation framework. Through six tasks spanning speech recognition, semantic understanding, and reasoning, the study systematically uncovers three critical phenomena: a perception–cognition gap (robust performance on low-level tasks but collapse in high-level reasoning), scene sensitivity (human-like acoustic interference being more detrimental than mechanical noise), and a denoising paradox (standard denoising techniques degrading performance by introducing semantic distortions). This approach transcends conventional single-source or Gaussian noise evaluation paradigms.

Technology Category

Application Category

📝 Abstract

While Audio Large Models (ALMs) have achieved remarkable proficiency, their robustness remains brittle in real-world deployment. Existing evaluations largely rely on synthetic Gaussian noise or simplistic single-source interference, failing to capture the intricate, multi-layered acoustic dynamics -- or ``Acoustic Ecology''-- that characterize authentic physical environments. To bridge this ecological gap, we introduce \textbf{RSA-Bench}, a comprehensive robustness benchmark designed to stress-test ALLMs through high-fidelity auditory scene simulations. Unlike traditional methods, we construct evaluation samples by naturally superimposing diverse environmental soundscapes -- spanning \textit{Pasture}, \textit{Extreme Weather}, \textit{Classroom}, and \textit{Outdoors} -- onto clean speech signals across a spectrum of interference intensities. By evaluating models on six core tasks ranging from fundamental perception to complex reasoning, our study unveils three macro-level insights: \textbf{(I) The Perception-Cognition Gap:} Models maintain relative resilience in low-level recognition but suffer a \textbf{functional collapse} in high-order reasoning tasks under stress; \textbf{(II) Scenario Sensitivity:} ``Vocal-like''interference (e.g., background laughter) proves significantly more destructive than mechanical noise, challenging the model's auditory attention mechanisms; and \textbf{(III) The Denoising Paradox:} Standard speech enhancement often exacerbates performance degradation, as ALLMs prove highly sensitive to the semantic distortions introduced by denoising artifacts.

Problem

Research questions and friction points this paper is trying to address.

Audio Large Models

Robustness

Acoustic Ecology

Real-World Scenarios

Auditory Scene Simulation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Audio Large Models

Robustness Benchmark

Acoustic Ecology