ConvApparel: A Benchmark Dataset and Validation Framework for User Simulators in Conversational Recommenders

๐Ÿ“… 2026-02-18
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the performance degradation of large language model (LLM)-driven user simulators in real-world settings due to the "reality gap." To bridge this gap, the authors introduce the ConvApparel dataset, which uniquely captures humanโ€“agent dialogues through a dual-agent protocol that jointly employs both โ€œgoodโ€ and โ€œbadโ€ recommender systems, complemented by first-person user satisfaction annotations. They further propose a multidimensional evaluation framework encompassing statistical alignment, humanness scoring, and counterfactual validation to systematically assess simulator fidelity and generalization. Experimental results reveal that existing simulators suffer from significant reality gaps, whereas data-driven approaches demonstrate superior adaptability and more authentic user modeling, particularly in counterfactual scenarios.

Technology Category

Application Category

๐Ÿ“ Abstract
The promise of LLM-based user simulators to improve conversational AI is hindered by a critical "realism gap," leading to systems that are optimized for simulated interactions, but may fail to perform well in the real world. We introduce ConvApparel, a new dataset of human-AI conversations designed to address this gap. Its unique dual-agent data collection protocol -- using both "good" and "bad" recommenders -- enables counterfactual validation by capturing a wide spectrum of user experiences, enriched with first-person annotations of user satisfaction. We propose a comprehensive validation framework that combines statistical alignment, a human-likeness score, and counterfactual validation to test for generalization. Our experiments reveal a significant realism gap across all simulators. However, the framework also shows that data-driven simulators outperform a prompted baseline, particularly in counterfactual validation where they adapt more realistically to unseen behaviors, suggesting they embody more robust, if imperfect, user models.
Problem

Research questions and friction points this paper is trying to address.

realism gap
user simulators
conversational recommenders
human-AI conversations
counterfactual validation
Innovation

Methods, ideas, or system contributions that make the work stand out.

user simulator
conversational recommender
counterfactual validation
realism gap
dual-agent data collection
๐Ÿ”Ž Similar Papers
No similar papers found.