π€ AI Summary
Existing user simulation methods struggle to capture the authentic decision-making behaviors of customers with distinct personalities in multi-turn, multimodal retail scenarios. To address this limitation, this work proposes SalesSim, a framework that models users as intelligent agents endowed with personas and tool-use capabilities, and introduces, for the first time, an evaluation paradigm centered on βdecision alignment.β Furthermore, the authors design UserGRPO, a multi-objective reinforcement learning algorithm that jointly optimizes conversational fluency and persona consistency. Experimental results demonstrate that prevailing multimodal large language models achieve less than 79% persona alignment on average; by incorporating UserGRPO, baseline models exhibit a 13.8% improvement in decision alignment alongside significantly enhanced dialogue quality.
π Abstract
We present SalesSim, a framework and testbed for evaluating the ability of Multimodal Large Language Models (MLLMs) to simulate realistic, persona-driven customer behavior in multi-turn, multi-modal, tool-augmented online retail conversations. Unlike prior work that treat user simulation as surface-level dialogue generation, SalesSim models retail interaction and decision-making as a grounded, agentic process, where shoppers with diverse backgrounds, preferences, and dealbreakers interact with a sales agent, seek clarifications, and make informed purchasing decisions. For evaluation, we design a suite of metrics centered on decision alignment, measuring the consistency between the simulator's actions and its persona specifications, as well as conversational quality. We find several behavioral gaps after benchmarking 6 open and closed-source state-of-the-art models. First, while models produce fluent conversations, they display significantly lower lexical diversity and overdisclosure of criteria across personas compared to human conversations. Second, models tend to be persuaded by sales agent suggestions and drift from persona specifications. Even the strongest model achieves less than 79% average alignment with its underlying persona specifications. To make progress on these limitations, we propose UserGRPO, a multi-turn, multi-objective reinforcement learning recipe to optimize both conversational fluency and decision alignment under persona specifications. Our experiments demonstrate that UserGRPO boosts decision alignment of the baseline model by 13.8% while improving conversational quality. By introducing SalesSim, we provide a new testbed for the community to investigate and improve the adherence of user simulators in goal-oriented settings.