🤖 AI Summary
Current multimodal large language models (MLLMs) excel in natural image and document understanding but lack systematic investigation for musical score visual reasoning. To address this gap, we introduce MusiXQA—the first multimodal benchmark dedicated to musical score understanding—featuring synthetically generated scores with structured annotations (e.g., notes, chords, clefs) and diverse visual question-answering tasks. Leveraging MusiXTeX, we generate high-fidelity synthetic score data and fine-tune the Phi-3 architecture to develop Phi-3-MusiX, a specialized model for music notation understanding. Experimental results demonstrate that state-of-the-art MLLMs—including GPT-series models—exhibit significant performance limitations on MusiXQA, whereas Phi-3-MusiX achieves substantial gains. This work establishes the first standardized evaluation framework and dedicated modeling approach for visual reasoning over musical symbols, thereby bridging a critical gap in AI-driven music information processing and laying foundational groundwork for future research.
📝 Abstract
Multimodal Large Language Models (MLLMs) have achieved remarkable visual reasoning abilities in natural images, text-rich documents, and graphic designs. However, their ability to interpret music sheets remains underexplored. To bridge this gap, we introduce MusiXQA, the first comprehensive dataset for evaluating and advancing MLLMs in music sheet understanding. MusiXQA features high-quality synthetic music sheets generated via MusiXTeX, with structured annotations covering note pitch and duration, chords, clefs, key/time signatures, and text, enabling diverse visual QA tasks. Through extensive evaluations, we reveal significant limitations of current state-of-the-art MLLMs in this domain. Beyond benchmarking, we developed Phi-3-MusiX, an MLLM fine-tuned on our dataset, achieving significant performance gains over GPT-based methods. The proposed dataset and model establish a foundation for future advances in MLLMs for music sheet understanding. Code, data, and model will be released upon acceptance.