Adopting State-of-the-Art Pretrained Audio Representations for Music Recommender Systems

📅 2026-04-24

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

This study systematically evaluates the effectiveness of nine state-of-the-art pretrained audio models—including MusicFM, MERT, and Jukebox—in music recommendation tasks, addressing a critical gap at the intersection of music information retrieval (MIR) and recommender systems. Through end-to-end experiments combining five recommendation paradigms (KNN, shallow neural networks, contrastive multimodal projection, hybrid models, and BERT4Rec) under both popular and cold-start scenarios, the work provides the first comprehensive analysis of how pretrained audio representations vary in utility for recommendation and how their informational value differs from that in traditional MIR tasks. The findings demonstrate that the choice of audio representation significantly impacts recommendation performance, offering empirical grounding and strategic guidance for leveraging audio semantics to enhance recommender systems.

Technology Category

Application Category

📝 Abstract

Over the years, Music Information Retrieval (MIR) research community has released various models pretrained on large amounts of music data. Transfer learning showcases the proven effectiveness of pretrained backend models for a broad spectrum of downstream tasks, including auto-tagging and genre classification. However, MIR papers generally do not explore the efficiency of pretrained models for Music Recommender Systems (MRS). In addition, the Recommender Systems community tends to favour traditional end-to-end neural network training. Our research addresses this gap and evaluates the performance of nine pretrained backend models (MusicFM, Music2Vec, MERT, EncodecMAE, Jukebox, MusiCNN, MULE, MuQ and MuQ-MuLan) in the context of MRS. We assess them using five recommendation approaches: K-Nearest Neighbours (KNN), Shallow Neural Network, Contrastive Multi-Modal projection, a Hybrid model, and BERT4Rec both for the hot and cold-start scenarios. Our findings suggest that pretrained audio representations exhibit significant performance disparity between traditional MIR tasks and both hot and cold music recommendations, indicating that valuable aspects of musical information captured by backend models may differ depending on the task. This study establishes a foundation for further exploration of pretrained audio representations to enhance music recommendation systems.

Problem

Research questions and friction points this paper is trying to address.

pretrained audio representations

music recommender systems

transfer learning

cold-start recommendation

music information retrieval

Innovation

Methods, ideas, or system contributions that make the work stand out.

pretrained audio representations

music recommender systems

transfer learning