π€ AI Summary
Traditional offline evaluation of long-audio recommendations (e.g., podcasts) suffers from exposure bias, while online A/B testing incurs high operational costs. To address these challenges, this paper proposes a two-stage LLM-based offline evaluation framework. Its core innovation is a profile-aware mechanism: leveraging 90-day user listening histories, it constructs natural-language user profiles spanning both topical and behavioral dimensions; these semantically distilled profiles replace raw interaction sequences, substantially reducing LLM input complexity while enhancing relevance judgment accuracy and interpretability. The framework employs an LLM as an evaluator, supporting both pointwise and pairwise recommendation assessment. In a controlled study with 47 human annotators, the framework achieves strong agreement with manual evaluation (Cohenβs ΞΊ = 0.82), outperforming or matching baseline methods that rely on raw behavioral sequences. This approach significantly improves the iteration efficiency of recommendation systems.
π Abstract
Evaluating personalized recommendations remains a central challenge, especially in long-form audio domains like podcasts, where traditional offline metrics suffer from exposure bias and online methods such as A/B testing are costly and operationally constrained. In this paper, we propose a novel framework that leverages Large Language Models (LLMs) as offline judges to assess the quality of podcast recommendations in a scalable and interpretable manner. Our two-stage profile-aware approach first constructs natural-language user profiles distilled from 90 days of listening history. These profiles summarize both topical interests and behavioral patterns, serving as compact, interpretable representations of user preferences. Rather than prompting the LLM with raw data, we use these profiles to provide high-level, semantically rich context-enabling the LLM to reason more effectively about alignment between a user's interests and recommended episodes. This reduces input complexity and improves interpretability. The LLM is then prompted to deliver fine-grained pointwise and pairwise judgments based on the profile-episode match. In a controlled study with 47 participants, our profile-aware judge matched human judgments with high fidelity and outperformed or matched a variant using raw listening histories. The framework enables efficient, profile-aware evaluation for iterative testing and model selection in recommender systems.