Music Similarity Representation Learning Focusing on Individual Instruments with Source Separation and Human Preference

📅 2025-03-24

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

This work addresses the challenge of jointly achieving fine-grained representation learning and perceptual consistency in instrument-level music similarity modeling—without requiring clean, isolated instrument signals during inference. Methodologically, we propose a novel learning framework integrating (i) cascade-style end-to-end fine-tuning (Cascade+E2E-FT), (ii) perceptual alignment fine-tuning (PAFT), and (iii) disentangled-feature-driven multi-task direct learning—built upon music source separation (MSS). This enables joint optimization of representation disentanglement and auditory preference alignment. Experiments demonstrate that our approach significantly outperforms the Direct multi-task baseline on instrument-level similarity tasks, while concurrently improving separation robustness, feature disentanglement quality, and perceptual consistency. The framework establishes a new paradigm for unsupervised and weakly supervised instrument perception modeling.

Technology Category

Application Category

📝 Abstract

This paper proposes music similarity representation learning (MSRL) based on individual instrument sounds (InMSRL) utilizing music source separation (MSS) and human preference without requiring clean instrument sounds during inference. We propose three methods that effectively improve performance. First, we introduce end-to-end fine-tuning (E2E-FT) for the Cascade approach that sequentially performs MSS and music similarity feature extraction. E2E-FT allows the model to minimize the adverse effects of a separation error on the feature extraction. Second, we propose multi-task learning for the Direct approach that directly extracts disentangled music similarity features using a single music similarity feature extractor. Multi-task learning, which is based on the disentangled music similarity feature extraction and MSS based on reconstruction with disentangled music similarity features, further enhances instrument feature disentanglement. Third, we employ perception-aware fine-tuning (PAFT). PAFT utilizes human preference, allowing the model to perform InMSRL aligned with human perceptual similarity. We conduct experimental evaluations and demonstrate that 1) E2E-FT for Cascade significantly improves InMSRL performance, 2) the multi-task learning for Direct is also helpful to improve disentanglement performance in the feature extraction, 3) PAFT significantly enhances the perceptual InMSRL performance, and 4) Cascade with E2E-FT and PAFT outperforms Direct with the multi-task learning and PAFT.

Problem

Research questions and friction points this paper is trying to address.

Learning music similarity from individual instrument sounds using source separation

Improving feature disentanglement via multi-task learning and reconstruction

Aligning music similarity with human perception through fine-tuning

Innovation

Methods, ideas, or system contributions that make the work stand out.

End-to-end fine-tuning for Cascade approach

Multi-task learning for Direct approach

Perception-aware fine-tuning with human preference

🔎 Similar Papers

No similar papers found.