🤖 AI Summary
Current evaluations of audio understanding are largely confined to automatic speech recognition, failing to capture models’ capabilities in real-world scenarios involving background sounds, noise localization, cross-lingual speech, and non-speech content. To address this gap, this work proposes SCENEBench—the first multidimensional audio understanding benchmark tailored to practical applications such as assistive technologies and industrial noise monitoring. It encompasses four key dimensions: spatial, cross-lingual, environmental, and non-speech understanding. SCENEBench integrates both synthetic and natural audio data, employs a multi-task evaluation protocol with latency measurements, and incorporates an ecological validity verification mechanism. Evaluations of five state-of-the-art large audio-language models reveal significant performance deficiencies across multiple tasks—some even below random chance—highlighting critical shortcomings in understanding “how something is said” and non-speech auditory content, thereby charting a clear direction for future research.
📝 Abstract
Advances in large language models (LLMs) have enabled significant capabilities in audio processing, resulting in state-of-the-art models now known as Large Audio Language Models (LALMs). However, minimal work has been done to measure audio understanding beyond automatic speech recognition (ASR). This paper closes that gap by proposing a benchmark suite, SCENEBench (Spatial, Cross-lingual, Environmental, Non-speech Evaluation), that targets a broad form of audio comprehension across four real-world categories: background sound understanding, noise localization, cross-linguistic speech understanding, and vocal characterizer recognition. These four categories are selected based on understudied needs from accessibility technology and industrial noise monitoring. In addition to performance, we also measure model latency. The purpose of this benchmark suite is to assess audio beyond just what words are said - rather, how they are said and the non-speech components of the audio. Because our audio samples are synthetically constructed (e.g., by overlaying two natural audio samples), we further validate our benchmark against 20 natural audio items per task, sub-sampled from existing datasets to match our task criteria, to assess ecological validity. We assess five state-of-the-art LALMs and find critical gaps: performance varies across tasks, with some tasks performing below random chance and others achieving high accuracy. These results provide direction for targeted improvements in model capabilities.