OPeRA: A Dataset of Observation, Persona, Rationale, and Action for Evaluating LLMs on Human Online Shopping Behavior Simulation

📅 2025-06-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Evaluating large language models (LLMs) in simulating user online shopping behavior—particularly next-action prediction—is hindered by a scarcity of high-quality, fine-grained, cognition-aware behavioral data. Method: We introduce OPeRA, the first publicly available shopping behavior simulation dataset featuring quadruple-aligned annotations: user personas, visual observations, action sequences, and intrinsic rationales. We design a digital-twin-oriented simulation benchmark, collecting real-world interactions via a custom browser extension and capturing cognitive processes through just-in-time structured questionnaires. Multimodal temporal alignment and privacy-preserving annotation yield hundreds of complete shopping sessions. Contribution/Results: Joint prediction of actions and rationales significantly improves performance: baseline models achieve 17.3% higher accuracy than single-task counterparts, empirically validating LLMs’ capacity to model users’ dynamic intentions in e-commerce contexts.

Technology Category

Application Category

📝 Abstract
Can large language models (LLMs) accurately simulate the next web action of a specific user? While LLMs have shown promising capabilities in generating ``believable'' human behaviors, evaluating their ability to mimic real user behaviors remains an open challenge, largely due to the lack of high-quality, publicly available datasets that capture both the observable actions and the internal reasoning of an actual human user. To address this gap, we introduce OPERA, a novel dataset of Observation, Persona, Rationale, and Action collected from real human participants during online shopping sessions. OPERA is the first public dataset that comprehensively captures: user personas, browser observations, fine-grained web actions, and self-reported just-in-time rationales. We developed both an online questionnaire and a custom browser plugin to gather this dataset with high fidelity. Using OPERA, we establish the first benchmark to evaluate how well current LLMs can predict a specific user's next action and rationale with a given persona andhistory. This dataset lays the groundwork for future research into LLM agents that aim to act as personalized digital twins for human.
Problem

Research questions and friction points this paper is trying to address.

Evaluate LLMs' ability to simulate specific user's next web action
Address lack of high-quality datasets capturing user actions and reasoning
Establish benchmark for LLMs predicting actions and rationales
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces OPERA dataset for LLM evaluation
Uses custom browser plugin for data collection
Establishes benchmark for personalized action prediction
🔎 Similar Papers
No similar papers found.