HEART-Bench: Do LLM Agents Exhibit Human-like Psychology?

📅 2026-05-28

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

This study addresses the lack of systematic evaluation of whether large language model (LLM) agents can consistently simulate coherent human psychological traits despite their task-oriented capabilities. To this end, the authors propose a novel paradigm integrating the Big Five personality theory with autobiographical episodic memory, constructing a benchmark dataset comprising 11 distinct personality profiles, 1,000 structured memory episodes, and 673 human-validated multiple-choice questions spanning 64 DIAMONDS situational contexts. This framework enables rigorous assessment of whether LLMs exhibit behavioral decisions aligned with their assigned personality traits. By uniquely combining personality psychology with structured memory representations, this work establishes the first scalable and systematic testbed for evaluating affective and behavioral consistency in LLM-based agents.

📝 Abstract

While LLM agents have demonstrated remarkable task-oriented abilities such as planning, reasoning, and action, few works have treated them as complete human personalities where emotional dimensions hold equal importance. In this paper, we introduce a novel benchmark to systematically assess whether LLM agents can simulate coherent, human-like psychology. Specifically, our benchmark constructs 11 diverse human characters grounded in orthogonal Big Five personality traits, with each profile deeply integrated with 1,000 structured autobiographical-style episodic memories distributed across theory-grounded developmental life stages. To rigorously evaluate the psychological manifestations of LLMs, we designed a curated suite of 64 decision-making scenarios, guided by the DIAMONDS taxonomy, a psychological framework that characterizes situations along eight dimensions: Duty, Intellect, Adversity, Mating, pOsitivity, Negativity, Deception, and Sociality. By subjecting agents to varying scenarios, the benchmark evaluates whether they can consolidate their innate personality traits and autobiographical memories to make behavioral decisions that are consistent with their specific psychological profiles. After systematic human validation and filtering, we obtained a benchmark consisting of 673 multiple-choice questions (MCQs). We believe this benchmark provides a principled and scalable testbed for studying human-like emotions, personality consistency, and value-consistent behavioural decision-making in LLM-based agents.

Problem

Research questions and friction points this paper is trying to address.

LLM agents

human-like psychology

personality consistency

emotional dimensions

behavioral decision-making

Innovation

Methods, ideas, or system contributions that make the work stand out.

HEART-Bench

Big Five personality

autobiographical memory