Towards Real-world Human Behavior Simulation: Benchmarking Large Language Models on Long-horizon, Cross-scenario, Heterogeneous Behavior Traces

📅 2026-04-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current large language models (LLMs) struggle to capture the long-term, cross-context, and heterogeneous complexity of real-world human behavior due to reliance on isolated scenarios, constrained action spaces, or synthetic data. This work proposes OmniBehavior—the first high-fidelity behavioral simulation benchmark constructed from authentic user logs—integrating longitudinal, multi-context, and heterogeneous behavioral trajectories to systematically evaluate LLMs’ capacity for behavioral modeling. Experimental results reveal that even with extended context windows, models consistently fail to reproduce complex behaviors, exhibiting structural limitations such as “tunnel vision,” personality homogenization, a bias toward an “optimistically average” persona, and utopian distortions. These findings underscore a fundamental inadequacy in existing approaches to cross-context causal modeling of human behavior.
📝 Abstract
The emergence of Large Language Models (LLMs) has illuminated the potential for a general-purpose user simulator. However, existing benchmarks remain constrained to isolated scenarios, narrow action spaces, or synthetic data, failing to capture the holistic nature of authentic human behavior. To bridge this gap, we introduce OmniBehavior, the first user simulation benchmark constructed entirely from real-world data, integrating long-horizon, cross-scenario, and heterogeneous behavioral patterns into a unified framework. Based on this benchmark, we first provide empirical evidence that previous datasets with isolated scenarios suffer from tunnel vision, whereas real-world decision-making relies on long-term, cross-scenario causal chains. Extensive evaluations of state-of-the-art LLMs reveal that current models struggle to accurately simulate these complex behaviors, with performance plateauing even as context windows expand. Crucially, a systematic comparison between simulated and authentic behaviors uncovers a fundamental structural bias: LLMs tend to converge toward a positive average person, exhibiting hyper-activity, persona homogenization, and a Utopian bias. This results in the loss of individual differences and long-tail behaviors, highlighting critical directions for future high-fidelity simulation research.
Problem

Research questions and friction points this paper is trying to address.

human behavior simulation
large language models
real-world data
long-horizon behavior
cross-scenario behavior
Innovation

Methods, ideas, or system contributions that make the work stand out.

OmniBehavior
long-horizon behavior
cross-scenario simulation
heterogeneous behavior traces
structural bias in LLMs
🔎 Similar Papers
No similar papers found.
Jiawei Chen
Jiawei Chen
Institute of Software, Chinese Academy of Sciences
Large Language Models
R
Ruoxi Xu
Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences; University of Chinese Academy of Sciences
Boxi Cao
Boxi Cao
Institute of Software, Chinese Academy of Sciences
Natural Language Processing
R
Ruotong Pan
Kuaishou Technology
Y
Yunfei Zhang
Kuaishou Technology
Y
Yifei Hu
Kuaishou Technology
Y
Yong Du
Kuaishou Technology
T
Tingting Gao
Kuaishou Technology
Yaojie Lu
Yaojie Lu
Institute of Software, Chinese Academy of Sciences
Information ExtractionLarge Language Models
Y
Yingfei Sun
University of Chinese Academy of Sciences
X
Xianpei Han
Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences
Le Sun
Le Sun
Institute of Software, CAS
information_retrievalnatural_language_processing
X
Xiangyu Wu
Kuaishou Technology
Hongyu Lin
Hongyu Lin
Institute of Software, Chinese Academy of Sciences
Natural Language ProcessingInformation Extraction and Machine Learning