WildScore: Benchmarking MLLMs in-the-Wild Symbolic Music Reasoning

📅 2025-09-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the lack of evaluation benchmarks for multimodal large language models’ (MLLMs) symbolic music understanding and reasoning capabilities. We introduce WildScore, the first multimodal benchmark grounded in real-world musical scores and professional music-theoretic questions. Methodologically, we propose a systematic musicological taxonomy that formalizes complex reasoning tasks as multiple-choice questions; construct a dataset from authentic score images and user-generated queries to support joint visual-symbolic reasoning; and integrate a musicological ontology for fine-grained, domain-aware assessment. Experimental evaluation of state-of-the-art MLLMs reveals—for the first time—their concrete limitations in expert-level tasks such as tonal analysis and formal structure identification. WildScore fills a critical gap in symbolic music multimodal reasoning evaluation and provides an empirically grounded, extensible assessment framework to guide the design and optimization of music AI models.

Technology Category

Application Category

📝 Abstract
Recent advances in Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities across various vision-language tasks. However, their reasoning abilities in the multimodal symbolic music domain remain largely unexplored. We introduce WildScore, the first in-the-wild multimodal symbolic music reasoning and analysis benchmark, designed to evaluate MLLMs' capacity to interpret real-world music scores and answer complex musicological queries. Each instance in WildScore is sourced from genuine musical compositions and accompanied by authentic user-generated questions and discussions, capturing the intricacies of practical music analysis. To facilitate systematic evaluation, we propose a systematic taxonomy, comprising both high-level and fine-grained musicological ontologies. Furthermore, we frame complex music reasoning as multiple-choice question answering, enabling controlled and scalable assessment of MLLMs' symbolic music understanding. Empirical benchmarking of state-of-the-art MLLMs on WildScore reveals intriguing patterns in their visual-symbolic reasoning, uncovering both promising directions and persistent challenges for MLLMs in symbolic music reasoning and analysis. We release the dataset and code.
Problem

Research questions and friction points this paper is trying to address.

Evaluating MLLMs' symbolic music reasoning abilities
Assessing interpretation of real-world music scores
Testing complex musicological query answering capabilities
Innovation

Methods, ideas, or system contributions that make the work stand out.

Created first in-the-wild multimodal symbolic music benchmark
Used real musical compositions with authentic user questions
Framed music reasoning as multiple-choice question answering
🔎 Similar Papers
No similar papers found.