The MUSE Benchmark: Probing Music Perception and Auditory Relational Reasoning in Audio LLMS

📅 2025-10-21

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

Current audio multimodal large language models (MLLMs) lack systematic evaluation frameworks tailored to music perception and auditory relational reasoning, leading to superficial assessments that obscure structural deficiencies. To address this, we introduce MUSE—the first open-source, music-cognition–focused benchmark comprising ten tasks spanning core musical dimensions: pitch, rhythm, harmony, and structured relational reasoning—augmented with human expert baselines. We evaluate state-of-the-art models including Gemini Pro, Qwen2.5-Omni, and Audio-Flamingo 3, revealing consistent and substantial performance gaps relative to human experts (with some models performing near chance level); notably, chain-of-thought prompting proves unstable and sometimes detrimental. MUSE uncovers fundamental limitations in audio semantic representation and hierarchical relational modeling within existing MLLMs. By providing a standardized, empirically grounded evaluation suite, MUSE establishes a rigorous foundation for music AI capability assessment and targeted model development.

Technology Category

Application Category

📝 Abstract

Multimodal Large Language Models (MLLMs) have demonstrated capabilities in audio understanding, but current evaluations may obscure fundamental weaknesses in relational reasoning. We introduce the Music Understanding and Structural Evaluation (MUSE) Benchmark, an open-source resource with 10 tasks designed to probe fundamental music perception skills. We evaluate four SOTA models (Gemini Pro and Flash, Qwen2.5-Omni, and Audio-Flamingo 3) against a large human baseline (N=200). Our results reveal a wide variance in SOTA capabilities and a persistent gap with human experts. While Gemini Pro succeeds on basic perception, Qwen and Audio Flamingo 3 perform at or near chance, exposing severe perceptual deficits. Furthermore, we find Chain-of-Thought (CoT) prompting provides inconsistent, often detrimental results. Our work provides a critical tool for evaluating invariant musical representations and driving development of more robust AI systems.

Problem

Research questions and friction points this paper is trying to address.

Evaluating music perception and relational reasoning in audio LLMs

Identifying performance gaps between AI models and human experts

Assessing inconsistent effects of Chain-of-Thought prompting

Innovation

Methods, ideas, or system contributions that make the work stand out.

MUSE Benchmark evaluates music perception in AI

Tests four SOTA models against human baseline

Reveals performance gaps and CoT prompting limitations

🔎 Similar Papers

No similar papers found.