🤖 AI Summary
This study addresses the challenge of evaluating the effectiveness of large-scale, theory-driven emotional support dialogue agents (LLM-based). We propose the first end-to-end automated evaluation framework grounded in Clara Hill’s Exploration–Insight–Action (EIA) counseling theory: it orchestrates parallel dialogues between two LLMs—one simulating a help-seeker and the other a support provider—then employs a psychology-informed, prompt-engineered “LLM-as-judge” to perform pairwise discrimination and anchored scoring across the three EIA dimensions. Our contribution lies in deeply integrating classical counseling theory into AI evaluation, enabling fully automated, interpretable, and reproducible large-scale model comparison while overcoming human annotation bottlenecks. Experiments show strong agreement with doctoral-level human annotators—85% on Exploration, 83% on Insight, and 86% on Action—demonstrating human-level reliability at significantly reduced cost.
📝 Abstract
Large language models (LLMs) increasingly power mental-health chatbots, yet the field still lacks a scalable, theory-grounded way to decide which model is most effective to deploy. We present ESC-Judge, the first end-to-end evaluation framework that (i) grounds head-to-head comparisons of emotional-support LLMs in Clara Hill's established Exploration-Insight-Action counseling model, providing a structured and interpretable view of performance, and (ii) fully automates the evaluation pipeline at scale. ESC-Judge operates in three stages: first, it synthesizes realistic help-seeker roles by sampling empirically salient attributes such as stressors, personality, and life history; second, it has two candidate support agents conduct separate sessions with the same role, isolating model-specific strategies; and third, it asks a specialized judge LLM to express pairwise preferences across rubric-anchored skills that span the Exploration, Insight, and Action spectrum. In our study, ESC-Judge matched PhD-level annotators on 85 percent of Exploration, 83 percent of Insight, and 86 percent of Action decisions, demonstrating human-level reliability at a fraction of the cost. All code, prompts, synthetic roles, transcripts, and judgment scripts are released to promote transparent progress in emotionally supportive AI.