Layer-wise Investigation of Large-Scale Self-Supervised Music Representation Models

📅 2025-05-22

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

This study investigates the semantic properties and task adaptability of hidden-layer representations in music self-supervised learning (SSL) pre-trained models—specifically MusicFM and MuQ. We propose the first layer-wise functional disentanglement analysis framework for music SSL models, integrating layer-freezing fine-tuning, cross-layer feature similarity analysis, and task-oriented layer importance scoring. Systematic evaluation is conducted across six downstream tasks on benchmarks including MUSICCAPS and NSynth. Results reveal a hierarchical semantic organization: mid-level layers predominantly encode acoustic attributes (e.g., rhythm, timbre), whereas higher layers specialize in abstract semantic understanding—challenging conventional black-box evaluation paradigms. Based on this, we derive an optimal layer selection strategy: MuQ’s 12th layer serves as the most robust general-purpose representation, while MusicFM’s 8th layer achieves peak performance on fine-grained timbre tasks. Overall, our layer-aware adaptation yields an average 2.3% improvement in downstream task accuracy.

Technology Category

Application Category

📝 Abstract

Recently, pre-trained models for music information retrieval based on self-supervised learning (SSL) are becoming popular, showing success in various downstream tasks. However, there is limited research on the specific meanings of the encoded information and their applicability. Exploring these aspects can help us better understand their capabilities and limitations, leading to more effective use in downstream tasks. In this study, we analyze the advanced music representation model MusicFM and the newly emerged SSL model MuQ. We focus on three main aspects: (i) validating the advantages of SSL models across multiple downstream tasks, (ii) exploring the specialization of layer-wise information for different tasks, and (iii) comparing performance differences when selecting specific layers. Through this analysis, we reveal insights into the structure and potential applications of SSL models in music information retrieval.

Problem

Research questions and friction points this paper is trying to address.

Analyzing encoded information meanings in SSL music models

Exploring layer-wise specialization for different music tasks

Comparing performance impacts of specific layer selection

Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyzing MusicFM and MuQ SSL models layer-wise

Validating SSL models in multiple downstream tasks

Comparing performance differences of specific layers

🔎 Similar Papers

Exploring Acoustic Similarity in Emotional Speech and Music via Self-Supervised Representations