Advancing the Foundation Model for Music Understanding

📅 2025-08-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Music Information Retrieval (MIR) has long suffered from model fragmentation, lacking a unified multimodal foundation model capable of generalizing across diverse downstream tasks. Method: We propose MuFun—the first foundation model for comprehensive music understanding—designed to jointly model audio and lyrics via a multimodal architecture integrating cross-modal attention and self-supervised learning. To rigorously evaluate such models, we introduce MuCUE, the first comprehensive benchmark for multimodal music understanding. Contribution/Results: Trained on large-scale aligned audio–text data, MuFun achieves state-of-the-art performance across music classification, tagging, and question answering on MuCUE, significantly outperforming existing audio-centric large language models. It demonstrates strong cross-task generalization and establishes a new paradigm for MIR: shifting from task-specific models toward unified, general-purpose foundation models.

Technology Category

Application Category

📝 Abstract
The field of Music Information Retrieval (MIR) is fragmented, with specialized models excelling at isolated tasks. In this work, we challenge this paradigm by introducing a unified foundation model named MuFun for holistic music understanding. Our model features a novel architecture that jointly processes instrumental and lyrical content, and is trained on a large-scale dataset covering diverse tasks such as genre classification, music tagging, and question answering. To facilitate robust evaluation, we also propose a new benchmark for multi-faceted music understanding called MuCUE (Music Comprehensive Understanding Evaluation). Experiments show our model significantly outperforms existing audio large language models across the MuCUE tasks, demonstrating its state-of-the-art effectiveness and generalization ability.
Problem

Research questions and friction points this paper is trying to address.

Fragmented Music Information Retrieval models for isolated tasks
Lack of unified foundation model for holistic music understanding
Need for robust evaluation benchmark in music understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified foundation model for holistic music understanding
Joint processing of instrumental and lyrical content
New benchmark for multi-faceted music evaluation
🔎 Similar Papers
No similar papers found.
Y
Yi Jiang
Zhejiang University
W
Wei Wang
NetEase Cloud Music
X
Xianwen Guo
NetEase Cloud Music
Huiyun Liu
Huiyun Liu
NetEase Cloud Music
Hanrui Wang
Hanrui Wang
MIT
Deep LearningComputer Architecture
Y
Youri Xu
NetEase Cloud Music
H
Haoqi Gu
NetEase Cloud Music
Z
Zhongqian Xie
NetEase Cloud Music
Chuanjiang Luo
Chuanjiang Luo
Google Inc.
Geometric computingspectral shape analysis