🤖 AI Summary
Existing audio benchmarks predominantly rely on textual semantics, neglecting fine-grained perception and spatiotemporal reasoning about dynamic acoustic variations across time and 3D space. Method: This work introduces the novel concept of “Audio 4D Intelligence” and proposes STAR-Bench—a rigorously constructed benchmark featuring high-fidelity audio generated via physics-based simulation and procedural synthesis, annotated through a four-stage human curation pipeline and filtered by human performance thresholds. It encompasses diverse tasks: absolute/relative attribute judgment, audio segment reordering, multi-source localization, and dynamic trajectory prediction. Contribution/Results: STAR-Bench systematically exposes critical deficiencies in current models—including 19 leading closed- and open-source systems—regarding non-linguistic acoustic cue understanding: pure-text response accuracy drops by 31.5% (temporal) and 35.2% (spatial) relative to human performance, revealing fundamental gaps in auditory perception, multimodal knowledge integration, and spatiotemporal reasoning.
📝 Abstract
Despite rapid progress in Multi-modal Large Language Models and Large Audio-Language Models, existing audio benchmarks largely test semantics that can be recovered from text captions, masking deficits in fine-grained perceptual reasoning. We formalize audio 4D intelligence that is defined as reasoning over sound dynamics in time and 3D space, and introduce STAR-Bench to measure it. STAR-Bench combines a Foundational Acoustic Perception setting (six attributes under absolute and relative regimes) with a Holistic Spatio-Temporal Reasoning setting that includes segment reordering for continuous and discrete processes and spatial tasks spanning static localization, multi-source relations, and dynamic trajectories. Our data curation pipeline uses two methods to ensure high-quality samples. For foundational tasks, we use procedurally synthesized and physics-simulated audio. For holistic data, we follow a four-stage process that includes human annotation and final selection based on human performance. Unlike prior benchmarks where caption-only answering reduces accuracy slightly, STAR-Bench induces far larger drops (-31.5% temporal, -35.2% spatial), evidencing its focus on linguistically hard-to-describe cues. Evaluating 19 models reveals substantial gaps compared with humans and a capability hierarchy: closed-source models are bottlenecked by fine-grained perception, while open-source models lag across perception, knowledge, and reasoning. Our STAR-Bench provides critical insights and a clear path forward for developing future models with a more robust understanding of the physical world.