🤖 AI Summary
Existing approaches struggle to model the dynamic evolution of e-commerce users’ intents across multiple browsing sessions, primarily due to overreliance on shallow textual features (e.g., titles and descriptions) and the absence of annotated data and evaluation benchmarks for cross-session intent transfer.
Method: We propose the “Intent Tree”—a novel hierarchical structure that explicitly models intent evolution across sessions—and introduce SessionIntentBench, a large-scale, multimodal, multi-task benchmark comprising 1.97 million intent annotations and over 10 million derivable tasks.
Contribution/Results: Experiments reveal that current Large Vision-Language Models (LVLMs) perform poorly on intent migration tasks. Integrating Intent Tree representations significantly improves model performance. This work establishes a new paradigm, dataset, and evaluation standard for understanding e-commerce user behavior, advancing research in cross-session intent modeling and multimodal session understanding.
📝 Abstract
Session history is a common way of recording user interacting behaviors throughout a browsing activity with multiple products. For example, if an user clicks a product webpage and then leaves, it might because there are certain features that don't satisfy the user, which serve as an important indicator of on-the-spot user preferences. However, all prior works fail to capture and model customer intention effectively because insufficient information exploitation and only apparent information like descriptions and titles are used. There is also a lack of data and corresponding benchmark for explicitly modeling intention in E-commerce product purchase sessions. To address these issues, we introduce the concept of an intention tree and propose a dataset curation pipeline. Together, we construct a sibling multimodal benchmark, SessionIntentBench, that evaluates L(V)LMs' capability on understanding inter-session intention shift with four subtasks. With 1,952,177 intention entries, 1,132,145 session intention trajectories, and 13,003,664 available tasks mined using 10,905 sessions, we provide a scalable way to exploit the existing session data for customer intention understanding. We conduct human annotations to collect ground-truth label for a subset of collected data to form an evaluation gold set. Extensive experiments on the annotated data further confirm that current L(V)LMs fail to capture and utilize the intention across the complex session setting. Further analysis show injecting intention enhances LLMs' performances.