🤖 AI Summary
This study critically examines the validity of using large language models (LLMs) as substitutes for human participants in psychological research. Method: We systematically compare responses from multiple LLMs—including CENTAUR—with empirical human data across standardized psychological tasks, assessing semantic sensitivity and cross-model consistency. Results reveal that minor prompt perturbations induce substantial deviations from human behavioral patterns; inter-model response variability remains high even after fine-tuning on psychological tasks, and no model achieves stable, robust alignment with human data. This constitutes the first empirical demonstration that LLMs lack internal consistency and ecological validity in psychological simulation, thereby challenging the foundational assumption that they can serve as reliable proxies for human cognition and behavior. Contribution: The study establishes “human-data validation” as a methodological necessity and provides a rigorous caution against uncritical adoption of LLMs in social science research, offering concrete guidance for their principled, empirically grounded application.
📝 Abstract
Large Language Models (LLMs),such as ChatGPT, are increasingly used in research, ranging from simple writing assistance to complex data annotation tasks. Recently, some research has suggested that LLMs may even be able to simulate human psychology and can, hence, replace human participants in psychological studies. We caution against this approach. We provide conceptual arguments against the hypothesis that LLMs simulate human psychology. We then present empiric evidence illustrating our arguments by demonstrating that slight changes to wording that correspond to large changes in meaning lead to notable discrepancies between LLMs' and human responses, even for the recent CENTAUR model that was specifically fine-tuned on psychological responses. Additionally, different LLMs show very different responses to novel items, further illustrating their lack of reliability. We conclude that LLMs do not simulate human psychology and recommend that psychological researchers should treat LLMs as useful but fundamentally unreliable tools that need to be validated against human responses for every new application.