๐ค AI Summary
This work addresses the performance degradation of large language model (LLM)-driven user simulators in real-world settings due to the "reality gap." To bridge this gap, the authors introduce the ConvApparel dataset, which uniquely captures humanโagent dialogues through a dual-agent protocol that jointly employs both โgoodโ and โbadโ recommender systems, complemented by first-person user satisfaction annotations. They further propose a multidimensional evaluation framework encompassing statistical alignment, humanness scoring, and counterfactual validation to systematically assess simulator fidelity and generalization. Experimental results reveal that existing simulators suffer from significant reality gaps, whereas data-driven approaches demonstrate superior adaptability and more authentic user modeling, particularly in counterfactual scenarios.
๐ Abstract
The promise of LLM-based user simulators to improve conversational AI is hindered by a critical "realism gap," leading to systems that are optimized for simulated interactions, but may fail to perform well in the real world. We introduce ConvApparel, a new dataset of human-AI conversations designed to address this gap. Its unique dual-agent data collection protocol -- using both "good" and "bad" recommenders -- enables counterfactual validation by capturing a wide spectrum of user experiences, enriched with first-person annotations of user satisfaction. We propose a comprehensive validation framework that combines statistical alignment, a human-likeness score, and counterfactual validation to test for generalization. Our experiments reveal a significant realism gap across all simulators. However, the framework also shows that data-driven simulators outperform a prompted baseline, particularly in counterfactual validation where they adapt more realistically to unseen behaviors, suggesting they embody more robust, if imperfect, user models.