🤖 AI Summary
Behavioral science has long lacked reliable participant simulators capable of accurately modeling human cognitive behavior. This paper systematically evaluates the large language model Centaur as a candidate simulator, proposing and validating three core criteria for participant simulation: behavioral fidelity, task generalizability, and mechanistic interpretability. We fine-tune Centaur on 160 human experimental datasets and conduct multidimensional evaluation integrating cognitive modeling with generative behavioral analysis. Results indicate that while Centaur exhibits modest behavioral prediction capability, its generated response patterns exhibit systematic biases and significantly deviate from empirical human data—failing to meet reliability requirements for valid simulation. To our knowledge, this work establishes the first operational evaluation framework for LLM-based participant simulators. It further identifies critical future directions: integrating explicit cognitive architectures with causal constraints to advance automated prototyping in experimental psychology.
📝 Abstract
Simulators have revolutionized scientific practice across the natural sciences. By generating data that reliably approximate real-world phenomena, they enable scientists to accelerate hypothesis testing and optimize experimental designs. This is perhaps best illustrated by AlphaFold, a Nobel-prize winning simulator in chemistry that predicts protein structures from amino acid sequences, enabling rapid prototyping of molecular interactions, drug targets, and protein functions. In the behavioral sciences, a reliable participant simulator - a system capable of producing human-like behavior across cognitive tasks - would represent a similarly transformative advance. Recently, Binz et al. introduced Centaur, a large language model (LLM) fine-tuned on human data from 160 experiments, proposing its use not only as a model of cognition but also as a participant simulator for "in silico prototyping of experimental studies", e.g., to advance automated cognitive science. Here, we review the core criteria for a participant simulator and assess how well Centaur meets them. Although Centaur demonstrates strong predictive accuracy, its generative behavior - a critical criterion for a participant simulator - systematically diverges from human data. This suggests that, while Centaur is a significant step toward predicting human behavior, it does not yet meet the standards of a reliable participant simulator or an accurate model of cognition.