🤖 AI Summary
Music conversational recommendation faces core challenges including weak audio-content reasoning and difficulty in cross-modal semantic alignment. This paper introduces MusiCRS, the first audio-centric benchmark, integrating real Reddit conversations with YouTube audio links across seven music genres, supporting evaluation under audio-only, text-only, and multimodal input settings. It pioneers joint modeling of raw audio signals and conversational semantics, uncovering fundamental limitations in current models’ ability to map abstract musical concepts (e.g., “laid-back jazz improvisation”) to concrete acoustic features. We propose an end-to-end evaluation framework incorporating audio-link grounding, configurable multimodal inputs, and LLM-driven dialogue understanding and retrieval. Experiments reveal that state-of-the-art methods heavily rely on textual cues and exhibit severe deficiencies in deep audio reasoning. The project open-sources the dataset, code, and baseline models to advance research in audio–language joint reasoning.
📝 Abstract
Conversational recommendation has advanced rapidly with large language models (LLMs), yet music remains a uniquely challenging domain where effective recommendations require reasoning over audio content beyond what text or metadata can capture. We present MusiCRS, the first benchmark for audio-centric conversational recommendation that links authentic user conversations from Reddit with corresponding audio tracks. MusiCRS contains 477 high-quality conversations spanning diverse genres (classical, hip-hop, electronic, metal, pop, indie, jazz) with 3,589 unique musical entities and audio grounding via YouTube links. MusiCRS enables evaluation across three input modality configurations: audio-only, query-only, and audio+query (multimodal), allowing systematic comparison of audio-LLMs, retrieval models, and traditional approaches. Our experiments reveal that current systems rely heavily on textual signals and struggle with nuanced audio reasoning. This exposes fundamental limitations in cross-modal knowledge integration where models excel at dialogue semantics but cannot effectively ground abstract musical concepts in actual audio content. To facilitate progress, we release the MusiCRS dataset (https://huggingface.co/datasets/rohan2810/MusiCRS), evaluation code (https://github.com/rohan2810/musiCRS), and comprehensive baselines.