🤖 AI Summary
Current conversational recommendation systems (CRS) suffer from a lack of high-quality, reproducible simulation-based evaluation resources. To address this, we propose the first open-source simulation evaluation platform specifically designed for CRS. Our method introduces three core innovations: (1) an enhanced agenda-driven user simulator that leverages large language models (LLMs) to generate realistic user intents and behaviors; (2) an LLM-as-a-judge automated evaluation framework supporting fine-grained, multi-dimensional assessment—including recommendation accuracy, dialogue coherence, and task completion rate; and (3) a modular architecture compatible with mainstream CRS models and diverse datasets. Extensive experiments demonstrate that our platform significantly improves simulation fidelity and evaluation efficiency. It outperforms baseline approaches in flexibility, extensibility, and evaluation consistency, establishing a standardized, reproducible benchmarking infrastructure for CRS research.
📝 Abstract
Resources for simulation-based evaluation of conversational recommender systems (CRSs) are scarce. The UserSimCRS toolkit was introduced to address this gap. In this work, we present UserSimCRS v2, a significant upgrade aligning the toolkit with state-of-the-art research. Key extensions include an enhanced agenda-based user simulator, introduction of large language model-based simulators, integration for a wider range of CRSs and datasets, and new LLM-as-a-judge evaluation utilities. We demonstrate these extensions in a case study.