OPeRA: A Dataset of Observation, Persona, Rationale, and Action for Evaluating LLMs on Human Online Shopping Behavior Simulation

📅 2025-06-05

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

Evaluating large language models (LLMs) in simulating user online shopping behavior—particularly next-action prediction—is hindered by a scarcity of high-quality, fine-grained, cognition-aware behavioral data. Method: We introduce OPeRA, the first publicly available shopping behavior simulation dataset featuring quadruple-aligned annotations: user personas, visual observations, action sequences, and intrinsic rationales. We design a digital-twin-oriented simulation benchmark, collecting real-world interactions via a custom browser extension and capturing cognitive processes through just-in-time structured questionnaires. Multimodal temporal alignment and privacy-preserving annotation yield hundreds of complete shopping sessions. Contribution/Results: Joint prediction of actions and rationales significantly improves performance: baseline models achieve 17.3% higher accuracy than single-task counterparts, empirically validating LLMs’ capacity to model users’ dynamic intentions in e-commerce contexts.

Technology Category

Application Category

📝 Abstract

Can large language models (LLMs) accurately simulate the next web action of a specific user? While LLMs have shown promising capabilities in generating ``believable'' human behaviors, evaluating their ability to mimic real user behaviors remains an open challenge, largely due to the lack of high-quality, publicly available datasets that capture both the observable actions and the internal reasoning of an actual human user. To address this gap, we introduce OPERA, a novel dataset of Observation, Persona, Rationale, and Action collected from real human participants during online shopping sessions. OPERA is the first public dataset that comprehensively captures: user personas, browser observations, fine-grained web actions, and self-reported just-in-time rationales. We developed both an online questionnaire and a custom browser plugin to gather this dataset with high fidelity. Using OPERA, we establish the first benchmark to evaluate how well current LLMs can predict a specific user's next action and rationale with a given persona andhistory. This dataset lays the groundwork for future research into LLM agents that aim to act as personalized digital twins for human.

Problem

Research questions and friction points this paper is trying to address.

Evaluate LLMs' ability to simulate specific user's next web action

Address lack of high-quality datasets capturing user actions and reasoning

Establish benchmark for LLMs predicting actions and rationales

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces OPERA dataset for LLM evaluation

Uses custom browser plugin for data collection

Establishes benchmark for personalized action prediction

🔎 Similar Papers

No similar papers found.