Can Large Audio Language Models Understand Audio Well? Speech, Scene and Events Understanding Benchmark for LALMs

📅 2025-09-16

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing audio understanding benchmarks overlook the coexistence and energy disparity between speech and non-speech components, and fail to evaluate joint comprehension across speech, scene, and event modalities. Method: We propose SSEU-Bench—the first unified benchmark integrating energy-aware audio segmentation with multi-task evaluation, enabling both independent and joint assessment of speech, scene, and event understanding. Our approach unifies cross-modal modeling, energy-adaptive segmentation, and chain-of-thought (CoT) reasoning to systematically evaluate large audio language models (LALMs) on joint understanding tasks for the first time. Contribution/Results: Experiments reveal a significant performance drop in current LALMs on joint tasks compared to single-modality tasks; incorporating CoT substantially improves both reasoning consistency and accuracy. This work identifies critical bottlenecks in real-world audio understanding and establishes a new paradigm for complex audio semantic parsing.

Technology Category

Application Category

📝 Abstract

Recently, Large Audio Language Models (LALMs) have progressed rapidly, demonstrating their strong efficacy in universal audio understanding through cross-modal integration. To evaluate the LALM's audio understanding performance, researchers have proposed different benchmarks. However, key aspects for real-world interactions are underexplored in existing benchmarks, i.e., audio signals typically contain both speech and non-speech components, and energy levels of these components can vary significantly across different scenarios. Moreover, most benchmarks do not consider the joint understanding of speech, scene, and events within the same audio clip. In this work, we introduce SSEU-Bench, the first versatile audio understanding benchmark that explicitly accounts for energy differences between speech and non-speech audio, with both independent and joint understanding settings for speech, scene, and events. Furthermore, we demonstrate that some LALMs tend to underperform on certain tasks in a joint understanding setting. To address this issue, we introduce Chain-of-Thought, which effectively improves the LALM's joint audio understanding performance by decomposing complex tasks into simpler reasoning steps

Problem

Research questions and friction points this paper is trying to address.

Evaluating LALMs' joint understanding of speech, scene, events

Addressing energy differences between speech and non-speech components

Improving performance in complex audio understanding tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introducing SSEU-Bench with energy-aware evaluation

Implementing joint understanding of speech, scene, events

Applying Chain-of-Thought reasoning for performance improvement

🔎 Similar Papers

AudioBench: A Universal Benchmark for Audio Large Language Models

2024-06-23arXiv.orgCitations: 17

Authors to Follow