TALKPLAY: Multimodal Music Recommendation with Large Language Models

📅 2025-02-19

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

This paper addresses the insufficient semantic understanding in natural language query-driven music recommendation by reformulating recommendation as a multimodal token generation task for large language models (LLMs). Methodologically, it unifies heterogeneous information—including audio, lyrics, metadata, semantic tags, and playlist co-occurrence—into a shared, extensible vocabulary of learnable tokens; constructs conversational sequences where “query → recommendation” serves as the next-token prediction objective, enabling end-to-end, query-aware joint modeling. Its key contribution lies in being the first to fully embed recommendation into the natural language understanding paradigm—eliminating the conventional separation between recommendation and dialogue systems—and achieving semantic alignment and joint optimization via cross-modal token embeddings. Experiments demonstrate significant improvements over state-of-the-art baselines on Recall@10, NDCG@10, and query-relevance metrics.

Technology Category

Application Category

📝 Abstract

We present TalkPlay, a multimodal music recommendation system that reformulates the recommendation task as large language model token generation. TalkPlay represents music through an expanded token vocabulary that encodes multiple modalities - audio, lyrics, metadata, semantic tags, and playlist co-occurrence. Using these rich representations, the model learns to generate recommendations through next-token prediction on music recommendation conversations, that requires learning the associations natural language query and response, as well as music items. In other words, the formulation transforms music recommendation into a natural language understanding task, where the model's ability to predict conversation tokens directly optimizes query-item relevance. Our approach eliminates traditional recommendation-dialogue pipeline complexity, enabling end-to-end learning of query-aware music recommendations. In the experiment, TalkPlay is successfully trained and outperforms baseline methods in various aspects, demonstrating strong context understanding as a conversational music recommender.

Problem

Research questions and friction points this paper is trying to address.

Multimodal music recommendation system

Large language model token generation

End-to-end learning of query-aware recommendations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal token generation

End-to-end recommendation learning

Conversational relevance optimization

🔎 Similar Papers

MMREC: LLM Based Multi-Modal Recommender System