🤖 AI Summary
Traditional retrieval-and-ranking evaluation frameworks are inadequate for LLM-driven Music Recommendation Systems (MRS), as LLMs’ generative nature, hallucination propensity, non-determinism, and knowledge limitations render conventional accuracy metrics invalid; moreover, existing user studies and fairness analyses lack depth for rigorous quality assessment. Method: We propose the first evaluation framework specifically designed for LLM-based MRS, integrating prompt engineering and decomposing evaluation into complementary “success dimensions” (e.g., relevance, diversity, explainability) and “risk dimensions” (e.g., hallucination, bias, temporal validity). Contribution/Results: This multidimensional, actionable, and interdisciplinary framework ensures theoretical rigor while offering practical guidance. It establishes a methodological foundation and community benchmark for developing trustworthy, transparent, and responsible LLM-MRS—advancing both research and deployment standards in generative music recommendation.
📝 Abstract
Music Recommender Systems (MRS) have long relied on an information-retrieval framing, where progress is measured mainly through accuracy on retrieval-oriented subtasks. While effective, this reductionist paradigm struggles to address the deeper question of what makes a good recommendation, and attempts to broaden evaluation, through user studies or fairness analyses, have had limited impact. The emergence of Large Language Models (LLMs) disrupts this framework: LLMs are generative rather than ranking-based, making standard accuracy metrics questionable. They also introduce challenges such as hallucinations, knowledge cutoffs, non-determinism, and opaque training data, rendering traditional train/test protocols difficult to interpret. At the same time, LLMs create new opportunities, enabling natural-language interaction and even allowing models to act as evaluators.
This work argues that the shift toward LLM-driven MRS requires rethinking evaluation. We first review how LLMs reshape user modeling, item modeling, and natural-language recommendation in music. We then examine evaluation practices from NLP, highlighting methodologies and open challenges relevant to MRS. Finally, we synthesize insights-focusing on how LLM prompting applies to MRS, to outline a structured set of success and risk dimensions. Our goal is to provide the MRS community with an updated, pedagogical, and cross-disciplinary perspective on evaluation.