🤖 AI Summary
This work addresses the limitation of large language models (LLMs) in web action generation—prioritizing subjective plausibility over objective behavioral accuracy—by introducing a behavior modeling paradigm grounded in real-world online shopping interaction data. Methodologically, it (1) establishes the first quantitative benchmark for evaluating web interaction behaviors, and (2) proposes a dual-path training framework combining real-data fine-tuning with synthetic reasoning trajectory augmentation, integrating behavioral sequence modeling and explicit stepwise reasoning injection. Experiments on DeepSeek-R1, Llama, and Claude demonstrate that, compared to prompt-engineering-only baselines, the approach achieves significant gains in action prediction accuracy on real-world action datasets; moreover, explicit reasoning substantially improves behavioral fidelity. This work provides a quantifiable, reproducible methodology for high-fidelity human behavior simulation and LLM-based agent development.
📝 Abstract
Recent research shows that LLMs can simulate ``believable'' human behaviors to power LLM agents via prompt-only methods. In this work, we focus on evaluating and improving LLM's objective ``accuracy'' rather than the subjective ``believability'' in the web action generation task, leveraging a large-scale, real-world dataset collected from online shopping human actions. We present the first comprehensive quantitative evaluation of state-of-the-art LLMs (e.g., DeepSeek-R1, Llama, and Claude) on the task of web action generation. Our results show that fine-tuning LLMs on real-world behavioral data substantially improves their ability to generate actions compared to prompt-only methods. Furthermore, incorporating synthesized reasoning traces into model training leads to additional performance gains, demonstrating the value of explicit rationale in behavior modeling. This work establishes a new benchmark for evaluating LLMs in behavior simulation and offers actionable insights into how real-world action data and reasoning augmentation can enhance the fidelity of LLM agents.