🤖 AI Summary
This study investigates whether large language model (LLM) agents can serve as valid substitutes for human subjects in behavioral research—specifically, whether their behavior exhibits internal consistency across experimental contexts, defined as coherent alignment between latent psychological states and overt responses.
Method: We propose a social-simulation–based evaluation framework integrating latent-variable modeling with structured dialogue experiments, enabling systematic hypothesis testing across diverse LLM families and parameter scales.
Contribution/Results: Despite surface-level human-like responsiveness, LLMs exhibit pervasive internal inconsistency: identical models generate contradictory beliefs and behaviors across contexts, failing to sustain stable psychobehavioral correspondence. This work establishes, for the first time, internal consistency as a fundamental limitation of LLM agents in human behavioral modeling and introduces a scalable, empirically grounded paradigm for assessing behavioral consistency in AI agents.
📝 Abstract
The impressive capabilities of Large Language Models (LLMs) have fueled the notion that synthetic agents can serve as substitutes for real participants in human-subject research. In an effort to evaluate the merits of this claim, social science researchers have largely focused on whether LLM-generated survey data corresponds to that of a human counterpart whom the LLM is prompted to represent. In contrast, we address a more fundamental question: Do agents maintain internal consistency, retaining similar behaviors when examined under different experimental settings? To this end, we develop a study designed to (a) reveal the agent's internal state and (b) examine agent behavior in a basic dialogue setting. This design enables us to explore a set of behavioral hypotheses to assess whether an agent's conversation behavior is consistent with what we would expect from their revealed internal state. Our findings on these hypotheses show significant internal inconsistencies in LLMs across model families and at differing model sizes. Most importantly, we find that, although agents may generate responses matching those of their human counterparts, they fail to be internally consistent, representing a critical gap in their capabilities to accurately substitute for real participants in human-subject research. Our simulation code and data are publicly accessible.