🤖 AI Summary
Prior work lacks systematic evaluation of large language models (LLMs) and vision-language models (VLMs) on comprehensive musical score understanding—including pitch, rhythm, harmony, and form. Method: We introduce MSU-Bench, the first large-scale, human-annotated benchmark for musical score understanding, supporting dual modalities (ABC notation and PDF sheet images) and comprising 1,800 generative question-answer pairs spanning four hierarchical understanding levels. It enables both zero-shot and fine-tuned evaluation. Contribution/Results: Experiments across 15+ state-of-the-art LLMs and VLMs reveal significant cross-modal performance gaps and hierarchical fragility in music understanding. Fine-tuning substantially improves musical comprehension without degrading general-domain knowledge. MSU-Bench establishes a new evaluation paradigm and foundational resource for music AI research.
📝 Abstract
Understanding complete musical scores requires reasoning over symbolic structures such as pitch, rhythm, harmony, and form. Despite the rapid progress of Large Language Models (LLMs) and Vision-Language Models (VLMs) in natural language and multimodal tasks, their ability to comprehend musical notation remains underexplored. We introduce Musical Score Understanding Benchmark (MSU-Bench), the first large-scale, human-curated benchmark for evaluating score-level musical understanding across both textual (ABC notation) and visual (PDF) modalities. MSU-Bench comprises 1,800 generative question-answer (QA) pairs drawn from works spanning Bach, Beethoven, Chopin, Debussy, and others, organised into four progressive levels of comprehension: Onset Information, Notation & Note, Chord & Harmony, and Texture & Form. Through extensive zero-shot and fine-tuned evaluations of over 15+ state-of-the-art (SOTA) models, we reveal sharp modality gaps, fragile level-wise success rates, and the difficulty of sustaining multilevel correctness. Fine-tuning markedly improves performance in both modalities while preserving general knowledge, establishing MSU-Bench as a rigorous foundation for future research at the intersection of Artificial Intelligence (AI), musicological, and multimodal reasoning.