π€ AI Summary
Existing audio understanding benchmarks overlook the coexistence and energy disparity between speech and non-speech components, and fail to evaluate joint comprehension across speech, scene, and event modalities. Method: We propose SSEU-Benchβthe first unified benchmark integrating energy-aware audio segmentation with multi-task evaluation, enabling both independent and joint assessment of speech, scene, and event understanding. Our approach unifies cross-modal modeling, energy-adaptive segmentation, and chain-of-thought (CoT) reasoning to systematically evaluate large audio language models (LALMs) on joint understanding tasks for the first time. Contribution/Results: Experiments reveal a significant performance drop in current LALMs on joint tasks compared to single-modality tasks; incorporating CoT substantially improves both reasoning consistency and accuracy. This work identifies critical bottlenecks in real-world audio understanding and establishes a new paradigm for complex audio semantic parsing.
π Abstract
Recently, Large Audio Language Models (LALMs) have progressed rapidly, demonstrating their strong efficacy in universal audio understanding through cross-modal integration. To evaluate the LALM's audio understanding performance, researchers have proposed different benchmarks. However, key aspects for real-world interactions are underexplored in existing benchmarks, i.e., audio signals typically contain both speech and non-speech components, and energy levels of these components can vary significantly across different scenarios. Moreover, most benchmarks do not consider the joint understanding of speech, scene, and events within the same audio clip. In this work, we introduce SSEU-Bench, the first versatile audio understanding benchmark that explicitly accounts for energy differences between speech and non-speech audio, with both independent and joint understanding settings for speech, scene, and events. Furthermore, we demonstrate that some LALMs tend to underperform on certain tasks in a joint understanding setting. To address this issue, we introduce Chain-of-Thought, which effectively improves the LALM's joint audio understanding performance by decomposing complex tasks into simpler reasoning steps