MusiXQA: Advancing Visual Music Understanding in Multimodal Large Language Models

📅 2025-06-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current multimodal large language models (MLLMs) excel in natural image and document understanding but lack systematic investigation for musical score visual reasoning. To address this gap, we introduce MusiXQA—the first multimodal benchmark dedicated to musical score understanding—featuring synthetically generated scores with structured annotations (e.g., notes, chords, clefs) and diverse visual question-answering tasks. Leveraging MusiXTeX, we generate high-fidelity synthetic score data and fine-tune the Phi-3 architecture to develop Phi-3-MusiX, a specialized model for music notation understanding. Experimental results demonstrate that state-of-the-art MLLMs—including GPT-series models—exhibit significant performance limitations on MusiXQA, whereas Phi-3-MusiX achieves substantial gains. This work establishes the first standardized evaluation framework and dedicated modeling approach for visual reasoning over musical symbols, thereby bridging a critical gap in AI-driven music information processing and laying foundational groundwork for future research.

Technology Category

Application Category

📝 Abstract
Multimodal Large Language Models (MLLMs) have achieved remarkable visual reasoning abilities in natural images, text-rich documents, and graphic designs. However, their ability to interpret music sheets remains underexplored. To bridge this gap, we introduce MusiXQA, the first comprehensive dataset for evaluating and advancing MLLMs in music sheet understanding. MusiXQA features high-quality synthetic music sheets generated via MusiXTeX, with structured annotations covering note pitch and duration, chords, clefs, key/time signatures, and text, enabling diverse visual QA tasks. Through extensive evaluations, we reveal significant limitations of current state-of-the-art MLLMs in this domain. Beyond benchmarking, we developed Phi-3-MusiX, an MLLM fine-tuned on our dataset, achieving significant performance gains over GPT-based methods. The proposed dataset and model establish a foundation for future advances in MLLMs for music sheet understanding. Code, data, and model will be released upon acceptance.
Problem

Research questions and friction points this paper is trying to address.

Evaluating MLLMs' ability to interpret music sheets
Addressing limitations in current MLLMs for music understanding
Advancing multimodal models for structured music sheet analysis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces MusiXQA dataset for music sheet understanding
Develops Phi-3-MusiX model via fine-tuning
Uses MusiXTeX for synthetic music sheet generation
🔎 Similar Papers
No similar papers found.
J
Jian Chen
University at Buffalo
W
Wenye Ma
Mohamed bin Zayed University of Artificial Intelligence
Penghang Liu
Penghang Liu
J.P.Morgan AI Research
Reinforcement LearningHuman BehaviorTemporal Graph Generation
W
Wei Wang
University at Buffalo
Tengwei Song
Tengwei Song
Beihang University
Knowledge GraphRepresentation Learning
M
Ming Li
University of Maryland
C
Chenguang Wang
Duke University
R
Ruiyi Zhang
Duke University
C
Changyou Chen
University at Buffalo