🤖 AI Summary
This work addresses the limited behavioral diversity and controllability of existing seeker simulators in evaluating emotional support chatbots, which hampers their ability to faithfully replicate real-world user behaviors. To overcome this, we propose the first controllable seeker simulator that integrates nine psychological and linguistic traits within a Mixture-of-Experts (MoE) architecture, trained on authentic Reddit conversation data. Our approach enables fine-grained modeling and precise control over diverse help-seeking behaviors. Experimental results demonstrate that the proposed simulator significantly outperforms current alternatives in both behavioral diversity and consistency with user profiles. Furthermore, when used to evaluate seven state-of-the-art emotional support models, it effectively uncovers performance degradation under complex scenarios, thereby enhancing the realism and stress-testing capability of model evaluation.
📝 Abstract
As emotional support chatbots have recently gained significant traction across both research and industry, a common evaluation strategy has emerged: use help-seeker simulators to interact with supporter chatbots. However, current simulators suffer from two critical limitations: (1) they fail to capture the behavioral diversity of real-world seekers, often portraying them as overly cooperative, and (2) they lack the controllability required to simulate specific seeker profiles. To address these challenges, we present a controllable seeker simulator driven by nine psychological and linguistic features that underpin seeker behavior. Using authentic Reddit conversations, we train our model via a Mixture-of-Experts (MoE) architecture, which effectively differentiates diverse seeker behaviors into specialized parameter subspaces, thereby enhancing fine-grained controllability. Our simulator achieves superior profile adherence and behavioral diversity compared to existing approaches. Furthermore, evaluating 7 prominent supporter models with our system uncovers previously obscured performance degradations. These findings underscore the utility of our framework in providing a more faithful and stress-tested evaluation for emotional support chatbots.