Sentient Agent as a Judge: Evaluating Higher-Order Social Cognition in Large Language Models

📅 2025-05-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitation of evaluating large language models (LLMs) solely on surface-level text matching, instead targeting their genuine human-like higher-order social cognition—specifically empathy and theory of mind. To this end, we propose SAGE: a framework that constructs embodied sentient agents integrating computationally modeled affective trajectories with interpretable, chain-of-thought representations of internal mental states. SAGE achieves, for the first time, automated empathy assessment strongly correlated (r > 0.87) with the psychological gold-standard Balanced Emotional Empathy Scale (BLRI). It incorporates psychometric alignment, supportive dialogue scenario synthesis, and a standardized evaluation protocol. Validated across 100 diverse conversational scenarios, SAGE yields the Sentient Leaderboard—a benchmark covering 18 LLMs—that reveals state-of-the-art models exhibit fourfold higher social cognitive capability than early models. Crucially, SAGE demonstrates markedly improved discriminability and ecological validity compared to conventional leaderboards such as Arena.

Technology Category

Application Category

📝 Abstract
Assessing how well a large language model (LLM) understands human, rather than merely text, remains an open challenge. To bridge the gap, we introduce Sentient Agent as a Judge (SAGE), an automated evaluation framework that measures an LLM's higher-order social cognition. SAGE instantiates a Sentient Agent that simulates human-like emotional changes and inner thoughts during interaction, providing a more realistic evaluation of the tested model in multi-turn conversations. At every turn, the agent reasons about (i) how its emotion changes, (ii) how it feels, and (iii) how it should reply, yielding a numerical emotion trajectory and interpretable inner thoughts. Experiments on 100 supportive-dialogue scenarios show that the final Sentient emotion score correlates strongly with Barrett-Lennard Relationship Inventory (BLRI) ratings and utterance-level empathy metrics, validating psychological fidelity. We also build a public Sentient Leaderboard covering 18 commercial and open-source models that uncovers substantial gaps (up to 4x) between frontier systems (GPT-4o-Latest, Gemini2.5-Pro) and earlier baselines, gaps not reflected in conventional leaderboards (e.g., Arena). SAGE thus provides a principled, scalable and interpretable tool for tracking progress toward genuinely empathetic and socially adept language agents.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM understanding of human social cognition
Measuring higher-order social cognition via emotional simulation
Assessing empathy gaps in commercial and open-source LLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

SAGE framework evaluates LLM social cognition
Sentient Agent simulates human emotions and thoughts
Emotion trajectory and inner thoughts provide metrics
🔎 Similar Papers
No similar papers found.
B
Bang Zhang
Hunyuan AI Digital Human, Tencent
R
Ruotian Ma
Hunyuan AI Digital Human, Tencent
Qingxuan Jiang
Qingxuan Jiang
Graduate Student, MIT
Machine LearningOptimization
Peisong Wang
Peisong Wang
CASIA
Deep Neural Network Acceleration and Compression
J
Jiaqi Chen
Hunyuan AI Digital Human, Tencent
Z
Zheng Xie
Hunyuan AI Digital Human, Tencent
X
Xingyu Chen
Hunyuan AI Digital Human, Tencent
Y
Yue Wang
Hunyuan AI Digital Human, Tencent
F
F. Ye
Hunyuan AI Digital Human, Tencent
J
Jian Li
Hunyuan AI Digital Human, Tencent
Y
Yifan Yang
Hunyuan AI Digital Human, Tencent
Zhaopeng Tu
Zhaopeng Tu
Tech Lead @ Tencent Digital Human
Digital HumanAgentsLarge Language ModelsMachine Translation
X
Xiaolong Li
Hunyuan AI Digital Human, Tencent