🤖 AI Summary
This study addresses the limitations of current emotional support dialogue systems, which often underperform when interacting with distressed users exhibiting low engagement, resistance, or emotional volatility. Existing evaluations predominantly rely on idealized simulated users and thus fail to capture real-world challenges. To bridge this gap, the work introduces the first systematic definition and simulation of difficult help-seeker behaviors, establishing a worst-case evaluation framework grounded in expert-guided counseling principles and powered by large language model–based dialogue simulators. Four novel metrics—deep emotional understanding, exploratory guidance, balanced emotional support, and actionable practical support—are proposed to assess system robustness. Evaluations across 17 systems reveal significant performance degradation under extreme interactions; while general-purpose large models outperform specialized ones, they still struggle to sustain engagement or improve user affect. The study further demonstrates that training on worst-case simulation data effectively enhances small models’ robustness.
📝 Abstract
Emotional Support Dialogue Systems (ESDSes) are increasingly evaluated and trained with LLM-simulated seekers. However, such simulated seekers often behave as cooperative, average-case users who disclose clearly, respond constructively, and accept support within a few turns. This can lead to overly optimistic evaluation and obscure whether ESDSes can handle difficult help-seeking interactions. In this work, we study ESDS evaluation under worst-case interactions, where seekers are hard to help due to low engagement, resistance, limited self-disclosure, emotional volatility, or rigid negative interpretations. We first conduct an expert simulation study with eight experienced counselling professionals, who simulate difficult seekers, interact with existing Chinese ESDSes, provide scale ratings, and participate in semi-structured interviews. Based on this study, we derive worst-case seeker behaviours and identify key limitations of current systems. We then propose a worst-case evaluation framework consisting of an LLM-based worst-case seeker simulator and four worst-case-oriented metrics: Deep Emotional Understanding, Guided Exploration, Balanced Emotional Support, and Authentic and Grounded Support. Evaluating 17 systems, we find that nearly all models suffer substantial performance drops under worst-case interactions. Large general-purpose LLMs are generally more robust than specialised ESDSes, but even the strongest models struggle to sustain engagement and improve seekers' emotional states. Finally, we show that worst-case simulation can also generate useful training data, improving the robustness of smaller models.