🤖 AI Summary
Music Information Retrieval (MIR) has long suffered from model fragmentation, lacking a unified multimodal foundation model capable of generalizing across diverse downstream tasks.
Method: We propose MuFun—the first foundation model for comprehensive music understanding—designed to jointly model audio and lyrics via a multimodal architecture integrating cross-modal attention and self-supervised learning. To rigorously evaluate such models, we introduce MuCUE, the first comprehensive benchmark for multimodal music understanding.
Contribution/Results: Trained on large-scale aligned audio–text data, MuFun achieves state-of-the-art performance across music classification, tagging, and question answering on MuCUE, significantly outperforming existing audio-centric large language models. It demonstrates strong cross-task generalization and establishes a new paradigm for MIR: shifting from task-specific models toward unified, general-purpose foundation models.
📝 Abstract
The field of Music Information Retrieval (MIR) is fragmented, with specialized models excelling at isolated tasks. In this work, we challenge this paradigm by introducing a unified foundation model named MuFun for holistic music understanding. Our model features a novel architecture that jointly processes instrumental and lyrical content, and is trained on a large-scale dataset covering diverse tasks such as genre classification, music tagging, and question answering. To facilitate robust evaluation, we also propose a new benchmark for multi-faceted music understanding called MuCUE (Music Comprehensive Understanding Evaluation). Experiments show our model significantly outperforms existing audio large language models across the MuCUE tasks, demonstrating its state-of-the-art effectiveness and generalization ability.