🤖 AI Summary
This work proposes the first unified multimodal large language model capable of jointly modeling global semantic and temporal dynamic characteristics of music. Built upon an LLaVA-style encoder-decoder architecture augmented with a mixture-of-experts audio encoder, the model employs a three-stage progressive training strategy—comprising pretraining, supervised fine-tuning, and reinforcement learning—to achieve cross-modal music-language understanding. Its key contributions include the first single framework that simultaneously handles both global and temporal dimensions of music comprehension, along with the introduction of MusicBench, the largest music multiple-choice question-answering benchmark to date. The model achieves state-of-the-art performance across all evaluated benchmarks, including MuchoMusic (79.1%), MusicBench-Temporal (79.3%), and MusicBench-Global (81.3%).
📝 Abstract
In this paper, we propose GaMMA, a state-of-the-art (SoTA) large multimodal model (LMM) designed to achieve comprehensive musical content understanding. GaMMA inherits the streamlined encoder-decoder design of LLaVA, enabling effective cross-modal learning between music and language. By incorporating audio encoders in a mixture-of-experts manner, GaMMA effectively unifies both time-series and non-time-series music understanding tasks within one set of parameters. Our approach combines carefully curated datasets at scale with a progressive training pipeline, effectively pushing the boundaries of music understanding via pretraining, supervised fine-tuning (SFT), and reinforcement learning (RL). To comprehensively assess both temporal and non-temporal capability of music LMMs, we introduce MusicBench, the largest music-oriented benchmark, comprising 3,739 human-curated multiple-choice questions covering diverse aspects of musical understanding. Extensive experiments demonstrate that GaMMA establishes new SoTA in the music domain, achieving 79.1% accuracy on MuchoMusic, 79.3% on MusicBench-Temporal, and 81.3% on MusicBench-Global, consistently outperforming previous methods.