A Standardized Re-evaluation of Conversational Recommender Systems on the ReDial Dataset

📅 2026-05-13

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

This work addresses the lack of fair comparability in existing conversational recommender systems, which stems from inconsistent data preprocessing, label definitions, and large language model (LLM) usage. Under a unified evaluation framework, we re-evaluate seven representative methods across three architectural paradigms through standardized preprocessing, controlled experiments, and reproducibility analysis. Our findings reveal that fine-grained ranking performance is highly sensitive to implementation details, with nearly half of the reported accuracy attributable to non-novel “repetition shortcuts.” We propose user-centric utility metrics and demonstrate that observed performance gains primarily arise from the LLM itself rather than architectural innovations. Moreover, conventional recall-based metrics substantially overestimate real-world conversational effectiveness, leading us to establish a transparent and controllable evaluation baseline.

📝 Abstract

Recent years have seen a surge of research into conversational recommender systems (CRS). Among existing datasets, ReDial is the most widely used benchmark, cited in hundreds of studies. However, variations in how the dataset is preprocessed and used in experiments, particularly in the definition of ground-truth items, make it difficult to compare results across studies. These comparisons are further complicated by confounding factors such as the choice of the underlying large language model (LLM) and the use of external data sources. In this work, we revisit seven prominent CRS methods across three architectural families and evaluate them under standardized conditions. Our reproducibility study reveals a ``granularity gap,'' where fine-grained ranking (Recall@1) is highly sensitive to implementation details, while our replicability analysis shows that nearly 50% of reported accuracy stems from ``repetition shortcuts'' that are absent in novelty-focused evaluation. Furthermore, we find that performance gains are often driven more by the capacity of the LLM backbone than by specific architectural innovations. Finally, by applying user-centric utility metrics, we demonstrate that traditional recall frequently overstates a system's actual conversational effectiveness. This work establishes a transparent, controlled baseline and promotes evaluation practices that prioritize novelty and interaction efficiency.

Problem

Research questions and friction points this paper is trying to address.

Conversational Recommender Systems

ReDial Dataset

Evaluation Standardization

Reproducibility

Recall Metrics

Innovation

Methods, ideas, or system contributions that make the work stand out.

conversational recommender systems

standardized evaluation

reproducibility