CMI-Bench: A Comprehensive Benchmark for Evaluating Music Instruction Following

📅 2025-06-14

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

Existing Music Information Retrieval (MIR) evaluation benchmarks suffer from task oversimplification and narrow paradigms, failing to capture the multifaceted nature of real-world music understanding. Method: We introduce CMI-Bench—the first instruction-following benchmark for music understanding—covering 13 core MIR tasks (e.g., emotion regression, melody extraction, beat tracking), with all annotations uniformly reformulated into structured instruction formats. We propose a novel multi-task instruction-tuning paradigm, employing standard metrics consistent with supervised models to ensure cross-paradigm comparability. A unified evaluation framework is built upon open-source audio-text LLMs (e.g., LTU, Qwen-Audio), integrated with standardized protocols and automated tooling. Results: Experiments reveal that current audio-text LLMs underperform supervised models across most tasks and exhibit systematic biases related to culture, era, and gender. CMI-Bench is the first open-source, reproducible, multi-task benchmark for instruction-based music evaluation.

Technology Category

Application Category

📝 Abstract

Recent advances in audio-text large language models (LLMs) have opened new possibilities for music understanding and generation. However, existing benchmarks are limited in scope, often relying on simplified tasks or multi-choice evaluations that fail to reflect the complexity of real-world music analysis. We reinterpret a broad range of traditional MIR annotations as instruction-following formats and introduce CMI-Bench, a comprehensive music instruction following benchmark designed to evaluate audio-text LLMs on a diverse set of music information retrieval (MIR) tasks. These include genre classification, emotion regression, emotion tagging, instrument classification, pitch estimation, key detection, lyrics transcription, melody extraction, vocal technique recognition, instrument performance technique detection, music tagging, music captioning, and (down)beat tracking: reflecting core challenges in MIR research. Unlike previous benchmarks, CMI-Bench adopts standardized evaluation metrics consistent with previous state-of-the-art MIR models, ensuring direct comparability with supervised approaches. We provide an evaluation toolkit supporting all open-source audio-textual LLMs, including LTU, Qwen-audio, SALMONN, MusiLingo, etc. Experiment results reveal significant performance gaps between LLMs and supervised models, along with their culture, chronological and gender bias, highlighting the potential and limitations of current models in addressing MIR tasks. CMI-Bench establishes a unified foundation for evaluating music instruction following, driving progress in music-aware LLMs.

Problem

Research questions and friction points this paper is trying to address.

Evaluating music instruction following in audio-text LLMs

Addressing limitations of existing music benchmarks

Assessing diverse music information retrieval tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinterprets MIR annotations as instruction-following formats

Introduces CMI-Bench for diverse MIR task evaluation

Provides standardized metrics and evaluation toolkit

🔎 Similar Papers

Are we there yet? A brief survey of Music Emotion Prediction Datasets, Models and Outstanding Challenges