Predicting Effects, Missing Distributions: Evaluating LLMs as Human Behavior Simulators in Operations Management

📅 2025-09-30

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

This study systematically evaluates the efficacy of large language models (LLMs) in simulating human decision-making behavior within operations management, focusing on their ability to reproduce well-established behavioral effects and align output response distributions with human empirical data. Leveraging data from nine published behavioral experiments, we quantify distributional divergence between LLM outputs and human responses using the Wasserstein distance. We introduce two lightweight interventions—chain-of-thought prompting and hyperparameter tuning—to improve alignment. Results show that while mainstream LLMs successfully replicate most canonical decision biases, their response distributions exhibit systematic deviations from human data; after intervention, distributional alignment improves significantly, with certain open-weight small models even outperforming commercial large models. This work is the first to adopt distributional alignment as a core evaluation metric and proposes a low-cost, transferable calibration framework—providing both methodological foundations and empirical evidence for the trustworthy deployment of LLMs in behavioral modeling and operational decision support.

Technology Category

Application Category

📝 Abstract

LLMs are emerging tools for simulating human behavior in business, economics, and social science, offering a lower-cost complement to laboratory experiments, field studies, and surveys. This paper evaluates how well LLMs replicate human behavior in operations management. Using nine published experiments in behavioral operations, we assess two criteria: replication of hypothesis-test outcomes and distributional alignment via Wasserstein distance. LLMs reproduce most hypothesis-level effects, capturing key decision biases, but their response distributions diverge from human data, including for strong commercial models. We also test two lightweight interventions -- chain-of-thought prompting and hyperparameter tuning -- which reduce misalignment and can sometimes let smaller or open-source models match or surpass larger systems.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' ability to replicate human behavior in operations management

Assessing hypothesis replication and distribution alignment with human data

Testing interventions to reduce distributional divergence in LLM simulations

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLMs simulate human behavior in operations management

Evaluate hypothesis replication and distributional alignment

Chain-of-thought prompting reduces misalignment with humans

🔎 Similar Papers

Can LLMs Replace Economic Choice Prediction Labs? The Case of Language-based Persuasion Games