Bridging the Gap Between Semantic and User Preference Spaces for Multi-modal Music Representation Learning

📅 2025-05-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing music representation learning methods either neglect linguistic semantics or rely on scarce, costly audio–text annotations, while largely ignoring user preference modeling—limiting recommendation performance. To address this, we propose a hierarchical two-stage contrastive learning framework: (1) cross-modal contrastive pretraining on large-scale unpaired audio and text data; and (2) preference-aware contrastive fine-tuning using real-world user interaction logs. Our approach is the first to hierarchically integrate semantic understanding and user behavior modeling within a unified framework—without requiring人工 audio–text pairs or adhering to conventional collaborative filtering paradigms. Experiments demonstrate significant improvements over state-of-the-art methods on both music semantic retrieval and personalized recommendation tasks. The learned audio encoder achieves strong semantic alignment with textual descriptions while simultaneously encoding user-specific preferences, enabling dual-purpose representational utility.

Technology Category

Application Category

📝 Abstract
Recent works of music representation learning mainly focus on learning acoustic music representations with unlabeled audios or further attempt to acquire multi-modal music representations with scarce annotated audio-text pairs. They either ignore the language semantics or rely on labeled audio datasets that are difficult and expensive to create. Moreover, merely modeling semantic space usually fails to achieve satisfactory performance on music recommendation tasks since the user preference space is ignored. In this paper, we propose a novel Hierarchical Two-stage Contrastive Learning (HTCL) method that models similarity from the semantic perspective to the user perspective hierarchically to learn a comprehensive music representation bridging the gap between semantic and user preference spaces. We devise a scalable audio encoder and leverage a pre-trained BERT model as the text encoder to learn audio-text semantics via large-scale contrastive pre-training. Further, we explore a simple yet effective way to exploit interaction data from our online music platform to adapt the semantic space to user preference space via contrastive fine-tuning, which differs from previous works that follow the idea of collaborative filtering. As a result, we obtain a powerful audio encoder that not only distills language semantics from the text encoder but also models similarity in user preference space with the integrity of semantic space preserved. Experimental results on both music semantic and recommendation tasks confirm the effectiveness of our method.
Problem

Research questions and friction points this paper is trying to address.

Bridging semantic and user preference spaces in music representation
Learning multi-modal music representations without extensive labeled data
Improving music recommendation by modeling user preferences and semantics
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical Two-stage Contrastive Learning method
Scalable audio encoder with pre-trained BERT
Contrastive fine-tuning with user interaction data
🔎 Similar Papers
No similar papers found.
X
Xiaofeng Pan
NetEase Inc., Hangzhou, China
J
Jing Chen
NetEase Inc., Hangzhou, China
Haitong Zhang
Haitong Zhang
NetEase Inc., Hangzhou, China
M
Menglin Xing
NetEase Inc., Hangzhou, China
Jiayi Wei
Jiayi Wei
Microsoft AI
artificial intelligenceprogramming languagesLLM
X
Xuefeng Mu
NetEase Inc., Hangzhou, China
Z
Zhongqian Xie
NetEase Inc., Hangzhou, China