MusiCRS: Benchmarking Audio-Centric Conversational Recommendation

📅 2025-09-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Music conversational recommendation faces core challenges including weak audio-content reasoning and difficulty in cross-modal semantic alignment. This paper introduces MusiCRS, the first audio-centric benchmark, integrating real Reddit conversations with YouTube audio links across seven music genres, supporting evaluation under audio-only, text-only, and multimodal input settings. It pioneers joint modeling of raw audio signals and conversational semantics, uncovering fundamental limitations in current models’ ability to map abstract musical concepts (e.g., “laid-back jazz improvisation”) to concrete acoustic features. We propose an end-to-end evaluation framework incorporating audio-link grounding, configurable multimodal inputs, and LLM-driven dialogue understanding and retrieval. Experiments reveal that state-of-the-art methods heavily rely on textual cues and exhibit severe deficiencies in deep audio reasoning. The project open-sources the dataset, code, and baseline models to advance research in audio–language joint reasoning.

Technology Category

Application Category

📝 Abstract
Conversational recommendation has advanced rapidly with large language models (LLMs), yet music remains a uniquely challenging domain where effective recommendations require reasoning over audio content beyond what text or metadata can capture. We present MusiCRS, the first benchmark for audio-centric conversational recommendation that links authentic user conversations from Reddit with corresponding audio tracks. MusiCRS contains 477 high-quality conversations spanning diverse genres (classical, hip-hop, electronic, metal, pop, indie, jazz) with 3,589 unique musical entities and audio grounding via YouTube links. MusiCRS enables evaluation across three input modality configurations: audio-only, query-only, and audio+query (multimodal), allowing systematic comparison of audio-LLMs, retrieval models, and traditional approaches. Our experiments reveal that current systems rely heavily on textual signals and struggle with nuanced audio reasoning. This exposes fundamental limitations in cross-modal knowledge integration where models excel at dialogue semantics but cannot effectively ground abstract musical concepts in actual audio content. To facilitate progress, we release the MusiCRS dataset (https://huggingface.co/datasets/rohan2810/MusiCRS), evaluation code (https://github.com/rohan2810/musiCRS), and comprehensive baselines.
Problem

Research questions and friction points this paper is trying to address.

Music recommendation requires reasoning over audio content beyond text metadata
Lack of benchmarks linking user conversations with actual audio tracks for evaluation
Current systems struggle with nuanced audio reasoning and cross-modal knowledge integration
Innovation

Methods, ideas, or system contributions that make the work stand out.

Links Reddit conversations with audio tracks
Evaluates audio-only, query-only, multimodal configurations
Reveals limitations in cross-modal knowledge integration
🔎 Similar Papers
No similar papers found.