🤖 AI Summary
This work addresses the underexplored Sim2Real gap between large language model (LLM)-based user simulators and actual human behavior in interactive tasks. We propose the User-Sim Index (USI), a novel metric for systematically evaluating simulation fidelity through a large-scale controlled study involving 31 LLM-based simulators and 451 real users across 165 tasks under the full τ-bench protocol. Through multidimensional behavioral analysis, rule-based reward mechanisms, and comparisons across model families, we quantitatively reveal systematic biases in LLM simulators: they consistently exhibit excessive cooperativeness, overly positive feedback, and limited stylistic diversity, leading to inflated agent success rates. Notably, improvements in general model capabilities do not necessarily translate to higher simulation fidelity. Our findings establish USI as a new standard for assessing user simulation realism and underscore the critical need for human-in-the-loop validation.
📝 Abstract
As NLP evaluation shifts from static benchmarks to multi-turn interactive settings, LLM-based simulators have become widely used as user proxies, serving two roles: generating user turns and providing evaluation signals. Yet, these simulations are frequently assumed to be faithful to real human behaviors, often without rigorous verification. We formalize the Sim2Real gap in user simulation and present the first study running the full $τ$-bench protocol with real humans (451 participants, 165 tasks), benchmarking 31 LLM simulators across proprietary, open-source, and specialized families using the User-Sim Index (USI), a metric we introduce to quantify how well LLM simulators resemble real user interactive behaviors and feedback. Behaviorally, LLM simulators are excessively cooperative, stylistically uniform, and lack realistic frustration or ambiguity, creating an "easy mode" that inflates agent success rates above the human baseline. In evaluations, real humans provide nuanced judgments across eight quality dimensions while simulated users produce uniformly more positive feedback; rule-based rewards are failing to capture rich feedback signals generated by human users. Overall, higher general model capability does not necessarily yield more faithful user simulation. These findings highlight the importance of human validation when using LLM-based user simulators in the agent development cycle and motivate improved models for user simulation.