🤖 AI Summary
Existing audio reasoning benchmarks predominantly focus on static, single-source scenarios, failing to assess models’ capacity to comprehend multi-speaker interactions, dynamically evolving auditory events, and heterogeneous audio sources. To address this gap, we introduce MADAR—the first benchmark explicitly designed for multi-scenario, dynamically evolving audio reasoning. MADAR encompasses diverse acoustic modalities (e.g., speech, environmental sounds) and features five complex reasoning task categories across three question formats: multiple-choice, multiple-select, and open-ended QA. It is rigorously constructed from 3,000 carefully curated audio–question pairs. Comprehensive evaluation reveals severe limitations of state-of-the-art audio-language models: even the best-performing models—Qwen2.5-Omni and GPT-4o Audio—achieve only 76.67% accuracy on multiple-choice questions, with substantially lower performance on multiple-select and open-ended tasks; no model exceeds 80% accuracy on any format. These results fundamentally expose critical deficiencies in current models’ dynamic auditory understanding capabilities.
📝 Abstract
The ability to reason from audio, including speech, paralinguistic cues, environmental sounds, and music, is essential for AI agents to interact effectively in real-world scenarios. Existing benchmarks mainly focus on static or single-scene settings and do not fully capture scenarios where multiple speakers, unfolding events, and heterogeneous audio sources interact. To address these challenges, we introduce MDAR, a benchmark for evaluating models on complex, multi-scene, and dynamically evolving audio reasoning tasks. MDAR comprises 3,000 carefully curated question-answer pairs linked to diverse audio clips, covering five categories of complex reasoning and spanning three question types. We benchmark 26 state-of-the-art audio language models on MDAR and observe that they exhibit limitations in complex reasoning tasks. On single-choice questions, Qwen2.5-Omni (open-source) achieves 76.67% accuracy, whereas GPT-4o Audio (closed-source) reaches 68.47%; however, GPT-4o Audio substantially outperforms Qwen2.5-Omni on the more challenging multiple-choice and open-ended tasks. Across all three question types, no model achieves 80% performance. These findings underscore the unique challenges posed by MDAR and its value as a benchmark for advancing audio reasoning research.Code and benchmark can be found at https://github.com/luckyerr/MDAR.