MDAR: A Multi-scene Dynamic Audio Reasoning Benchmark

📅 2025-09-26

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

Existing audio reasoning benchmarks predominantly focus on static, single-source scenarios, failing to assess models’ capacity to comprehend multi-speaker interactions, dynamically evolving auditory events, and heterogeneous audio sources. To address this gap, we introduce MADAR—the first benchmark explicitly designed for multi-scenario, dynamically evolving audio reasoning. MADAR encompasses diverse acoustic modalities (e.g., speech, environmental sounds) and features five complex reasoning task categories across three question formats: multiple-choice, multiple-select, and open-ended QA. It is rigorously constructed from 3,000 carefully curated audio–question pairs. Comprehensive evaluation reveals severe limitations of state-of-the-art audio-language models: even the best-performing models—Qwen2.5-Omni and GPT-4o Audio—achieve only 76.67% accuracy on multiple-choice questions, with substantially lower performance on multiple-select and open-ended tasks; no model exceeds 80% accuracy on any format. These results fundamentally expose critical deficiencies in current models’ dynamic auditory understanding capabilities.

Technology Category

Application Category

📝 Abstract

The ability to reason from audio, including speech, paralinguistic cues, environmental sounds, and music, is essential for AI agents to interact effectively in real-world scenarios. Existing benchmarks mainly focus on static or single-scene settings and do not fully capture scenarios where multiple speakers, unfolding events, and heterogeneous audio sources interact. To address these challenges, we introduce MDAR, a benchmark for evaluating models on complex, multi-scene, and dynamically evolving audio reasoning tasks. MDAR comprises 3,000 carefully curated question-answer pairs linked to diverse audio clips, covering five categories of complex reasoning and spanning three question types. We benchmark 26 state-of-the-art audio language models on MDAR and observe that they exhibit limitations in complex reasoning tasks. On single-choice questions, Qwen2.5-Omni (open-source) achieves 76.67% accuracy, whereas GPT-4o Audio (closed-source) reaches 68.47%; however, GPT-4o Audio substantially outperforms Qwen2.5-Omni on the more challenging multiple-choice and open-ended tasks. Across all three question types, no model achieves 80% performance. These findings underscore the unique challenges posed by MDAR and its value as a benchmark for advancing audio reasoning research.Code and benchmark can be found at https://github.com/luckyerr/MDAR.

Problem

Research questions and friction points this paper is trying to address.

Addresses limitations in multi-scene dynamic audio reasoning

Evaluates complex reasoning across diverse audio sources

Benchmarks AI models on evolving audio scenarios

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-scene dynamic audio reasoning benchmark

3000 curated question-answer pairs

Five categories of complex reasoning tasks

🔎 Similar Papers

AudioBench: A Universal Benchmark for Audio Large Language Models

2024-06-23arXiv.orgCitations: 17

What Are They Doing? Joint Audio-Speech Co-Reasoning

2024-09-22arXiv.orgCitations: 0

Anthropic

$350,000—$500,000 USD

San Francisco, CA, USA

Research Scientist Intern, Multimodal AI (PhD)