Not Yet AlphaFold for the Mind: Evaluating Centaur as a Synthetic Participant

📅 2025-08-11

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

Behavioral science has long lacked reliable participant simulators capable of accurately modeling human cognitive behavior. This paper systematically evaluates the large language model Centaur as a candidate simulator, proposing and validating three core criteria for participant simulation: behavioral fidelity, task generalizability, and mechanistic interpretability. We fine-tune Centaur on 160 human experimental datasets and conduct multidimensional evaluation integrating cognitive modeling with generative behavioral analysis. Results indicate that while Centaur exhibits modest behavioral prediction capability, its generated response patterns exhibit systematic biases and significantly deviate from empirical human data—failing to meet reliability requirements for valid simulation. To our knowledge, this work establishes the first operational evaluation framework for LLM-based participant simulators. It further identifies critical future directions: integrating explicit cognitive architectures with causal constraints to advance automated prototyping in experimental psychology.

Technology Category

Application Category

📝 Abstract

Simulators have revolutionized scientific practice across the natural sciences. By generating data that reliably approximate real-world phenomena, they enable scientists to accelerate hypothesis testing and optimize experimental designs. This is perhaps best illustrated by AlphaFold, a Nobel-prize winning simulator in chemistry that predicts protein structures from amino acid sequences, enabling rapid prototyping of molecular interactions, drug targets, and protein functions. In the behavioral sciences, a reliable participant simulator - a system capable of producing human-like behavior across cognitive tasks - would represent a similarly transformative advance. Recently, Binz et al. introduced Centaur, a large language model (LLM) fine-tuned on human data from 160 experiments, proposing its use not only as a model of cognition but also as a participant simulator for "in silico prototyping of experimental studies", e.g., to advance automated cognitive science. Here, we review the core criteria for a participant simulator and assess how well Centaur meets them. Although Centaur demonstrates strong predictive accuracy, its generative behavior - a critical criterion for a participant simulator - systematically diverges from human data. This suggests that, while Centaur is a significant step toward predicting human behavior, it does not yet meet the standards of a reliable participant simulator or an accurate model of cognition.

Problem

Research questions and friction points this paper is trying to address.

Evaluating Centaur as a synthetic participant simulator

Assessing Centaur's predictive accuracy and generative behavior

Determining if Centaur meets reliable participant simulator standards

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM fine-tuned on human experimental data

Generative behavior evaluation for cognition

In silico prototyping for cognitive studies

🔎 Similar Papers

Human Simulacra: Benchmarking the Personification of Large Language Models

2024-02-28Citations: 1

Large language models as linguistic simulators and cognitive models in human research

2024-02-06Citations: 1

Entering Real Social World! Benchmarking the Social Intelligence of Large Language Models from a First-person Perspective

2024-10-08Citations: 0