🤖 AI Summary
This work addresses data scarcity and limited generalization in music downstream tasks by proposing SoniDo, a music foundation model, and a hierarchical intermediate representation enhancement paradigm. Methodologically, it freezes the parameters of a pre-trained Music Foundation Model (MFM) and systematically extracts its multi-level intermediate features as universal “representation enhancers,” enabling zero-shot transfer to both understanding tasks (e.g., music annotation, transcription) and generation tasks (e.g., source separation, mixing) without fine-tuning the base model. Its key contribution is the first unified formulation of frozen foundation models’ hierarchical internal representations as lightweight, plug-and-play cross-task enhancement modules. Extensive experiments under a multi-task joint evaluation framework demonstrate that SoniDo consistently outperforms all baselines, achieving significant gains in performance, generalization, and model reusability—particularly in low-data regimes.
📝 Abstract
We demonstrate the efficacy of using intermediate representations from a single foundation model to enhance various music downstream tasks. We introduce SoniDo, a music foundation model (MFM) designed to extract hierarchical features from target music samples. By leveraging hierarchical intermediate features, SoniDo constrains the information granularity, leading to improved performance across various downstream tasks including both understanding and generative tasks. We specifically evaluated this approach on representative tasks such as music tagging, music transcription, music source separation, and music mixing. Our results reveal that the features extracted from foundation models provide valuable enhancements in training downstream task models. This highlights the capability of using features extracted from music foundation models as a booster for downstream tasks. Our approach not only benefits existing task-specific models but also supports music downstream tasks constrained by data scarcity. This paves the way for more effective and accessible music processing solutions.