🤖 AI Summary
This paper addresses the inadequacy of traditional accuracy metrics in evaluating generative recommender systems (Gen-RecSys). We propose the first multidimensional evaluation framework targeting factual consistency, safety, fairness, and user-intent alignment. Methodologically, we systematically categorize Gen-RecSys risks into two classes: *hallucinatory generation* and *bias/privacy leakage*. Our framework adopts a scenario-driven, multi-metric evaluation paradigm, integrating prompt engineering, fact-checking models, bias auditing tools, dialogue safety protocols, and interpretability analysis. We further release the first open-source prototype benchmark for Gen-RecSys evaluation. Experimental results demonstrate that our framework effectively detects item hallucinations, quantifies recommendation bias, and verifies policy compliance—thereby significantly enhancing evaluation comprehensiveness, trustworthiness, and deployment accountability.
📝 Abstract
Recommender systems powered by generative models (Gen-RecSys) extend beyond classical item ranking by producing open-ended content, which simultaneously unlocks richer user experiences and introduces new risks. On one hand, these systems can enhance personalization and appeal through dynamic explanations and multi-turn dialogues. On the other hand, they might venture into unknown territory-hallucinating nonexistent items, amplifying bias, or leaking private information. Traditional accuracy metrics cannot fully capture these challenges, as they fail to measure factual correctness, content safety, or alignment with user intent. This paper makes two main contributions. First, we categorize the evaluation challenges of Gen-RecSys into two groups: (i) existing concerns that are exacerbated by generative outputs (e.g., bias, privacy) and (ii) entirely new risks (e.g., item hallucinations, contradictory explanations). Second, we propose a holistic evaluation approach that includes scenario-based assessments and multi-metric checks-incorporating relevance, factual grounding, bias detection, and policy compliance. Our goal is to provide a guiding framework so researchers and practitioners can thoroughly assess Gen-RecSys, ensuring effective personalization and responsible deployment.