ConvApparel: A Benchmark Dataset and Validation Framework for User Simulators in Conversational Recommenders

📅 2026-02-18

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

This work addresses the performance degradation of large language model (LLM)-driven user simulators in real-world settings due to the "reality gap." To bridge this gap, the authors introduce the ConvApparel dataset, which uniquely captures human–agent dialogues through a dual-agent protocol that jointly employs both “good” and “bad” recommender systems, complemented by first-person user satisfaction annotations. They further propose a multidimensional evaluation framework encompassing statistical alignment, humanness scoring, and counterfactual validation to systematically assess simulator fidelity and generalization. Experimental results reveal that existing simulators suffer from significant reality gaps, whereas data-driven approaches demonstrate superior adaptability and more authentic user modeling, particularly in counterfactual scenarios.

Technology Category

Application Category

📝 Abstract

The promise of LLM-based user simulators to improve conversational AI is hindered by a critical "realism gap," leading to systems that are optimized for simulated interactions, but may fail to perform well in the real world. We introduce ConvApparel, a new dataset of human-AI conversations designed to address this gap. Its unique dual-agent data collection protocol -- using both "good" and "bad" recommenders -- enables counterfactual validation by capturing a wide spectrum of user experiences, enriched with first-person annotations of user satisfaction. We propose a comprehensive validation framework that combines statistical alignment, a human-likeness score, and counterfactual validation to test for generalization. Our experiments reveal a significant realism gap across all simulators. However, the framework also shows that data-driven simulators outperform a prompted baseline, particularly in counterfactual validation where they adapt more realistically to unseen behaviors, suggesting they embody more robust, if imperfect, user models.

Problem

Research questions and friction points this paper is trying to address.

realism gap

user simulators

conversational recommenders

human-AI conversations

counterfactual validation

Innovation

Methods, ideas, or system contributions that make the work stand out.

user simulator

conversational recommender

counterfactual validation