MusiXQA: Advancing Visual Music Understanding in Multimodal Large Language Models

📅 2025-06-28

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

Current multimodal large language models (MLLMs) excel in natural image and document understanding but lack systematic investigation for musical score visual reasoning. To address this gap, we introduce MusiXQA—the first multimodal benchmark dedicated to musical score understanding—featuring synthetically generated scores with structured annotations (e.g., notes, chords, clefs) and diverse visual question-answering tasks. Leveraging MusiXTeX, we generate high-fidelity synthetic score data and fine-tune the Phi-3 architecture to develop Phi-3-MusiX, a specialized model for music notation understanding. Experimental results demonstrate that state-of-the-art MLLMs—including GPT-series models—exhibit significant performance limitations on MusiXQA, whereas Phi-3-MusiX achieves substantial gains. This work establishes the first standardized evaluation framework and dedicated modeling approach for visual reasoning over musical symbols, thereby bridging a critical gap in AI-driven music information processing and laying foundational groundwork for future research.

Technology Category

Application Category

📝 Abstract

Multimodal Large Language Models (MLLMs) have achieved remarkable visual reasoning abilities in natural images, text-rich documents, and graphic designs. However, their ability to interpret music sheets remains underexplored. To bridge this gap, we introduce MusiXQA, the first comprehensive dataset for evaluating and advancing MLLMs in music sheet understanding. MusiXQA features high-quality synthetic music sheets generated via MusiXTeX, with structured annotations covering note pitch and duration, chords, clefs, key/time signatures, and text, enabling diverse visual QA tasks. Through extensive evaluations, we reveal significant limitations of current state-of-the-art MLLMs in this domain. Beyond benchmarking, we developed Phi-3-MusiX, an MLLM fine-tuned on our dataset, achieving significant performance gains over GPT-based methods. The proposed dataset and model establish a foundation for future advances in MLLMs for music sheet understanding. Code, data, and model will be released upon acceptance.

Problem

Research questions and friction points this paper is trying to address.

Evaluating MLLMs' ability to interpret music sheets

Addressing limitations in current MLLMs for music understanding

Advancing multimodal models for structured music sheet analysis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces MusiXQA dataset for music sheet understanding

Develops Phi-3-MusiX model via fine-tuning

Uses MusiXTeX for synthetic music sheet generation

🔎 Similar Papers

Unifying Multitrack Music Arrangement via Reconstruction Fine-Tuning and Efficient Tokenization