MuseAgent-1: Interactive Grounded Multimodal Understanding of Music Scores and Performance Audio

๐Ÿ“… 2026-01-17
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing general-purpose multimodal large language models struggle to establish fine-grained associations between sheet music images and performance audio at both perceptual and symbolic levels, limiting their capacity for interactive musical reasoning. This work proposes the first music-centric multimodal agent that unifies multimodal inputs into a structured symbolic representation by integrating optical music recognition (OMR) and automatic music transcription (AMT), thereby enabling cross-modal alignment and multi-step symbolic reasoning. The study introduces, for the first time, structured symbolic representations for aligning sheet music with audio and presents MuseBenchโ€”the first multimodal benchmark for music understanding encompassing music theory, score reading, and performance analysis. Experiments demonstrate that the proposed approach significantly outperforms existing general-purpose models on MuseBench, validating the efficacy of structured multimodal grounding for interactive music understanding.

Technology Category

Application Category

๐Ÿ“ Abstract
Despite recent advances in multimodal large language models (MLLMs), their ability to understand and interact with music remains limited. Music understanding requires grounded reasoning over symbolic scores and expressive performance audio, which general-purpose MLLMs often fail to handle due to insufficient perceptual grounding. We introduce MuseAgent, a music-centric multimodal agent that augments language models with structured symbolic representations derived from sheet music images and performance audio. By integrating optical music recognition and automatic music transcription modules, MuseAgent enables multi-step reasoning and interaction over fine-grained musical content. To systematically evaluate music understanding capabilities, we further propose MuseBench, a benchmark covering music theory reasoning, score interpretation, and performance-level analysis across text, image, and audio modalities. Experiments show that existing MLLMs perform poorly on these tasks, while MuseAgent achieves substantial improvements, highlighting the importance of structured multimodal grounding for interactive music understanding.
Problem

Research questions and friction points this paper is trying to address.

multimodal music understanding
symbolic music scores
performance audio
perceptual grounding
music reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal grounding
symbolic music representation
optical music recognition
music performance analysis
interactive music understanding
๐Ÿ”Ž Similar Papers
2024-03-06IEEE Transactions on Audio, Speech, and Language ProcessingCitations: 1