Music Foundation Model as Generic Booster for Music Downstream Tasks

📅 2024-11-02
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses data scarcity and limited generalization in music downstream tasks by proposing SoniDo, a music foundation model, and a hierarchical intermediate representation enhancement paradigm. Methodologically, it freezes the parameters of a pre-trained Music Foundation Model (MFM) and systematically extracts its multi-level intermediate features as universal “representation enhancers,” enabling zero-shot transfer to both understanding tasks (e.g., music annotation, transcription) and generation tasks (e.g., source separation, mixing) without fine-tuning the base model. Its key contribution is the first unified formulation of frozen foundation models’ hierarchical internal representations as lightweight, plug-and-play cross-task enhancement modules. Extensive experiments under a multi-task joint evaluation framework demonstrate that SoniDo consistently outperforms all baselines, achieving significant gains in performance, generalization, and model reusability—particularly in low-data regimes.

Technology Category

Application Category

📝 Abstract
We demonstrate the efficacy of using intermediate representations from a single foundation model to enhance various music downstream tasks. We introduce SoniDo, a music foundation model (MFM) designed to extract hierarchical features from target music samples. By leveraging hierarchical intermediate features, SoniDo constrains the information granularity, leading to improved performance across various downstream tasks including both understanding and generative tasks. We specifically evaluated this approach on representative tasks such as music tagging, music transcription, music source separation, and music mixing. Our results reveal that the features extracted from foundation models provide valuable enhancements in training downstream task models. This highlights the capability of using features extracted from music foundation models as a booster for downstream tasks. Our approach not only benefits existing task-specific models but also supports music downstream tasks constrained by data scarcity. This paves the way for more effective and accessible music processing solutions.
Problem

Research questions and friction points this paper is trying to address.

Enhancing music downstream tasks using foundation model features
Extracting hierarchical features for improved music task performance
Addressing data scarcity in music processing with foundation models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses single foundation model for multiple tasks
Extracts hierarchical features via SoniDo
Enhances performance in data-scarce scenarios
🔎 Similar Papers
No similar papers found.
W
Wei-Hsiang Liao
SonyAI, Tokyo, Japan
Yuhta Takida
Yuhta Takida
Sony AI
Machine LearningGenerative ModelingAcoustic Signal Processing
Yukara Ikemiya
Yukara Ikemiya
Sony
signal processing
Z
Zhi-Wei Zhong
Sony Group Corporation, Tokyo, Japan
C
Chieh-Hsin Lai
SonyAI, Tokyo, Japan
G
Giorgio Fabbro
Sony Europe B.V., Stuttgart, Germany
Kazuki Shimada
Kazuki Shimada
Sony
Signal ProcessingSpeech Recognition
Keisuke Toyama
Keisuke Toyama
Sony Group Corporation
Audio Signal ProcessingMusic Information RetrievalNatural Language Processing
K
K. Cheuk
SonyAI, Tokyo, Japan
M
Marco Martinez
SonyAI, Tokyo, Japan
Shusuke Takahashi
Shusuke Takahashi
Sony Group Corporation
audio signal processing
S
S. Uhlich
Sony Europe B.V., Stuttgart, Germany
T
Taketo Akama
Sony CSL, Tokyo, Japan
Woosung Choi
Woosung Choi
SonyAI
Machine LearningSignal ProcessingSource Separation
Y
Yuichiro Koyama
Sony Group Corporation, Tokyo, Japan
Yuki Mitsufuji
Yuki Mitsufuji
Distinguished Engineer, Sony
Machine LearningAudioSource SeparationMusic TechnologySpatial Audio