Can AI Replace Human Subjects? A Large-Scale Replication of Psychological Experiments with LLMs

📅 2024-08-29

🏛️ arXiv.org

📈 Citations: 11

✨ Influential: 0

career value

211K/year

🤖 AI Summary

This study systematically evaluates whether GPT-4 can serve as a computational proxy for human participants in replicating findings from psychology and management science scenario-based experiments. Using zero-shot prompting, we conducted large-scale replication attempts across 154 empirical studies—encompassing 618 main effects and 138 interaction effects—and applied a metascientific framework with multidimensional statistical consistency checks: directional agreement, significance concordance, 95% confidence interval (CI) coverage, and false-positive rate. To our knowledge, this is the first study to quantify LLM behavioral fidelity across >100 real-world experiments. Results show main-effect replication rates of 76.0% and interaction-effect rates of 47.0%; however, only 19.4% of replicated effect sizes fall within the original 95% CIs, and 71.6% of originally nonsignificant results are erroneously deemed significant—revealing a paradox of high directional consistency but low effect-size fidelity. These findings establish critical empirical benchmarks for the validity and limits of LLMs as computational subjects.

Technology Category

Application Category

📝 Abstract

Artificial Intelligence (AI) is increasingly being integrated into scientific research, particularly in the social sciences, where understanding human behavior is critical. Large Language Models (LLMs) like GPT-4 have shown promise in replicating human-like responses in various psychological experiments. However, the extent to which LLMs can effectively replace human subjects across diverse experimental contexts remains unclear. Here, we conduct a large-scale study replicating 154 psychological experiments from top social science journals with 618 main effects and 138 interaction effects using GPT-4 as a simulated participant. We find that GPT-4 successfully replicates 76.0 percent of main effects and 47.0 percent of interaction effects observed in the original studies, closely mirroring human responses in both direction and significance. However, only 19.44 percent of GPT-4's replicated confidence intervals contain the original effect sizes, with the majority of replicated effect sizes exceeding the 95 percent confidence interval of the original studies. Additionally, there is a 71.6 percent rate of unexpected significant results where the original studies reported null findings, suggesting potential overestimation or false positives. Our results demonstrate the potential of LLMs as powerful tools in psychological research but also emphasize the need for caution in interpreting AI-driven findings. While LLMs can complement human studies, they cannot yet fully replace the nuanced insights provided by human subjects.

Problem

Research questions and friction points this paper is trying to address.

Assessing if LLMs can replace human subjects in psychology experiments

Comparing LLM and human effect sizes in replicated studies

Evaluating LLM performance on socially sensitive research topics

Innovation

Methods, ideas, or system contributions that make the work stand out.

Replicated 156 experiments using advanced LLMs

Compared LLM and human effect sizes

Assessed LLM performance on sensitive topics

🔎 Similar Papers

Using Large Language Models to Create AI Personas for Replication and Prediction of Media Effects: An Empirical Test of 133 Published Experimental Research Findings