MMAR: A Challenging Benchmark for Deep Reasoning in Speech, Audio, Music, and Their Mix

📅 2025-05-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing audio-language model (ALM) benchmarks suffer from limitations in modality coverage, domain scope, and reasoning depth—focusing primarily on shallow, single-modality, single-domain recognition tasks. Method: We introduce MMAR, the first multimodal audio-language benchmark designed for deep reasoning. MMAR comprises 1,000 real-world video-derived audio–question–answer triplets and systematically defines a four-layer reasoning framework—acoustic, perceptual, semantic, and cultural—integrating heterogeneous audio modalities (speech, music, environmental sounds) and interdisciplinary knowledge (e.g., auditory cognition). It features comprehensive chain-of-thought (CoT) annotations and employs web-video-driven data collection, iterative error correction, and hierarchical task design. Contribution/Results: Experiments reveal that state-of-the-art ALMs achieve less than 35% accuracy on perceptual and cultural reasoning layers, exposing critical bottlenecks in deep audio understanding. MMAR establishes a standardized, interpretable, and reproducible evaluation platform to advance ALMs beyond recognition toward principled, multi-level reasoning.

Technology Category

Application Category

📝 Abstract
We introduce MMAR, a new benchmark designed to evaluate the deep reasoning capabilities of Audio-Language Models (ALMs) across massive multi-disciplinary tasks. MMAR comprises 1,000 meticulously curated audio-question-answer triplets, collected from real-world internet videos and refined through iterative error corrections and quality checks to ensure high quality. Unlike existing benchmarks that are limited to specific domains of sound, music, or speech, MMAR extends them to a broad spectrum of real-world audio scenarios, including mixed-modality combinations of sound, music, and speech. Each question in MMAR is hierarchically categorized across four reasoning layers: Signal, Perception, Semantic, and Cultural, with additional sub-categories within each layer to reflect task diversity and complexity. To further foster research in this area, we annotate every question with a Chain-of-Thought (CoT) rationale to promote future advancements in audio reasoning. Each item in the benchmark demands multi-step deep reasoning beyond surface-level understanding. Moreover, a part of the questions requires graduate-level perceptual and domain-specific knowledge, elevating the benchmark's difficulty and depth. We evaluate MMAR using a broad set of models, including Large Audio-Language Models (LALMs), Large Audio Reasoning Models (LARMs), Omni Language Models (OLMs), Large Language Models (LLMs), and Large Reasoning Models (LRMs), with audio caption inputs. The performance of these models on MMAR highlights the benchmark's challenging nature, and our analysis further reveals critical limitations of understanding and reasoning capabilities among current models. We hope MMAR will serve as a catalyst for future advances in this important but little-explored area.
Problem

Research questions and friction points this paper is trying to address.

Evaluates deep reasoning in multi-disciplinary audio-language tasks
Covers diverse real-world audio scenarios beyond single domains
Requires multi-step reasoning and advanced domain-specific knowledge
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical reasoning layers for diverse audio tasks
Chain-of-Thought annotations to enhance audio reasoning
Multi-disciplinary benchmark with mixed-modality audio scenarios
🔎 Similar Papers
No similar papers found.
Z
Ziyang Ma
Nanyang Technological University
Yinghao Ma
Yinghao Ma
PhD candidate, Centre for Digital Music (C4DM), Queen Mary University of London
Music Information RetrievalLarge Language ModelsMultimodal LearningAudio Signal Processing
Y
Yanqiao Zhu
Shanghai Jiao Tong University
C
Chen Yang
Shanghai Jiao Tong University
Y
Yi-Wen Chao
Nanyang Technological University
R
Ruiyang Xu
Shanghai Jiao Tong University
W
Wenxi Chen
Shanghai Jiao Tong University
Y
Yuanzhe Chen
ByteDance
Z
Zhuo Chen
ByteDance
Jian Cong
Jian Cong
ByteDance Seed
speech
K
Kai Li
Tsinghua University
K
Keliang Li
University of Chinese Academy of Sciences
S
Siyou Li
Queen Mary University of London
X
Xinfeng Li
Nanyang Technological University
Xiquan Li
Xiquan Li
Shanghai Jiao Tong University
Audio UnderstandingAudio GenerationLarge Language Models
Zheng Lian
Zheng Lian
Associate Professor, IEEE/CCF Senior Member, Institute of Automation, Chinese Academy of Sciences
Affective ComputingSentiment AnalysisMachine Learning
Yuzhe Liang
Yuzhe Liang
Shanghai Jiao Tong University
Deep learningMultimodal Learning
M
Minghao Liu
2077AI
Zhikang Niu
Zhikang Niu
Shanghai Jiao Tong University
Speech Synthesis
Tianrui Wang
Tianrui Wang
Tianjin University
Speech Signal Processing
Y
Yuping Wang
ByteDance
Y
Yuxuan Wang
ByteDance
Y
Yihao Wu
Nanyang Technological University
Guanrou Yang
Guanrou Yang
Shanghai Jiao Tong University
Jianwei Yu
Jianwei Yu
Tencent AI lab
ASR
Ruibin Yuan
Ruibin Yuan
HKUST
Artificial IntelligenceMusic GenerationMusic Information RetrievalComputer Music
Zhisheng Zheng
Zhisheng Zheng
The University of Texas at Austin
Speech and Language ProcessingNatural Language ProcessingMultimodal Learning
Ziya Zhou
Ziya Zhou
The Hong Kong University of Science and Technology
Music TechnologyNatural Language Processing
Haina Zhu
Haina Zhu
Shanghai Jiao Tong University
Music GenerationSelf-Supervised LearningDeep Reinforcement Learning
W
Wei Xue
Queen Mary University of London
Emmanouil Benetos
Emmanouil Benetos
Queen Mary University of London
Machine listeningAudio signal processingMusic information retrievalMachine learning
K
Kai Yu
Nanyang Technological University
Eng-Siong Chng
Eng-Siong Chng
Nanyang Technological University
Speech and Language processingDigital Signal ProcessingPattern Recognition
X
Xie Chen
Shanghai Jiao Tong University, Shanghai Innovation Institute