HeartBench: Probing Core Dimensions of Anthropomorphic Intelligence in LLMs

📅 2025-12-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current large language models (LLMs) lack rigorous, theory-driven, and clinically validated evaluation frameworks for assessing human-like intelligence—particularly in emotion understanding, cultural adaptation, and ethical reasoning—within Chinese contexts. To address this gap, we introduce HeartBench, the first Chinese benchmark for human-like intelligence evaluation. It features a five-dimensional, fifteen-level theoretical framework co-designed with clinical psychologists using authentic counseling scenarios. We propose the novel “reasoning-prior scoring” paradigm, transforming abstract human traits into quantifiable, fine-grained evaluation criteria. Additionally, we construct a difficulty-stratified Hard Set targeting emotionally metaphorical expressions and ethical dilemmas. Evaluations across 13 mainstream Chinese LLMs reveal that even the top-performing model achieves only 60% of expert-ideal scores; performance drops markedly on the Hard Set, demonstrating HeartBench’s high sensitivity to core bottlenecks in human-like intelligence and strong discriminative power.

Technology Category

Application Category

📝 Abstract
While Large Language Models (LLMs) have achieved remarkable success in cognitive and reasoning benchmarks, they exhibit a persistent deficit in anthropomorphic intelligence-the capacity to navigate complex social, emotional, and ethical nuances. This gap is particularly acute in the Chinese linguistic and cultural context, where a lack of specialized evaluation frameworks and high-quality socio-emotional data impedes progress. To address these limitations, we present HeartBench, a framework designed to evaluate the integrated emotional, cultural, and ethical dimensions of Chinese LLMs. Grounded in authentic psychological counseling scenarios and developed in collaboration with clinical experts, the benchmark is structured around a theory-driven taxonomy comprising five primary dimensions and 15 secondary capabilities. We implement a case-specific, rubric-based methodology that translates abstract human-like traits into granular, measurable criteria through a ``reasoning-before-scoring'' evaluation protocol. Our assessment of 13 state-of-the-art LLMs indicates a substantial performance ceiling: even leading models achieve only 60% of the expert-defined ideal score. Furthermore, analysis using a difficulty-stratified ``Hard Set'' reveals a significant performance decay in scenarios involving subtle emotional subtexts and complex ethical trade-offs. HeartBench establishes a standardized metric for anthropomorphic AI evaluation and provides a methodological blueprint for constructing high-quality, human-aligned training data.
Problem

Research questions and friction points this paper is trying to address.

Evaluates emotional, cultural, ethical intelligence in Chinese LLMs
Addresses lack of specialized frameworks for anthropomorphic AI assessment
Measures performance decay in complex socio-emotional scenarios
Innovation

Methods, ideas, or system contributions that make the work stand out.

Framework for evaluating emotional, cultural, ethical dimensions in Chinese LLMs
Case-specific rubric-based methodology with reasoning-before-scoring protocol
Theory-driven taxonomy from authentic psychological counseling scenarios
J
Jiaxin Liu
Ant Group
P
Peiyi Tu
Ant Group
Wenyu Chen
Wenyu Chen
Massachusetts Institute of Technology
optimizationstatistical learning
Y
Yihong Zhuang
Ant Group
X
Xinxia Ling
Ant Group, Xiamen University
A
Anji Zhou
Beijing Normal University
C
Chenxi Wang
Beijing Normal University
Zhuo Han
Zhuo Han
University of Massachusetts Amherst
Urban rail transit systemsmachine learningdeep learningenergy consumption
Z
Zhengkai Yang
Ant Group
J
Junbo Zhao
Ant Group, Zhejiang University
Zenan Huang
Zenan Huang
Ant Research
Machine LearningCausalityLLMs
Y
Yuanyuan Wang
Ant Group