To See or To Read: User Behavior Reasoning in Multimodal LLMs

📅 2025-11-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates the relative effectiveness of visual versus textual modalities in multimodal large language models (MLLMs) for user behavior reasoning. Addressing the lack of systematic modality comparison in prior work, we propose BehaviorLens—a framework that conducts the first comprehensive evaluation of six MLLMs on real-world purchase sequence data, assessing their modeling capabilities across three behavioral representations: textual paragraphs, scatter plots, and flowcharts. Results demonstrate that visual representations—particularly scatter plots—substantially outperform textual descriptions, improving next-purchase prediction accuracy by 87.5% without increasing computational overhead. This finding reveals the unique advantage of visual encoding in behavioral sequence modeling and provides empirical grounding and a new paradigm for modality selection in MLLMs targeting user behavior understanding.

Technology Category

Application Category

📝 Abstract
Multimodal Large Language Models (MLLMs) are reshaping how modern agentic systems reason over sequential user-behavior data. However, whether textual or image representations of user behavior data are more effective for maximizing MLLM performance remains underexplored. We present exttt{BehaviorLens}, a systematic benchmarking framework for assessing modality trade-offs in user-behavior reasoning across six MLLMs by representing transaction data as (1) a text paragraph, (2) a scatter plot, and (3) a flowchart. Using a real-world purchase-sequence dataset, we find that when data is represented as images, MLLMs next-purchase prediction accuracy is improved by 87.5% compared with an equivalent textual representation without any additional computational cost.
Problem

Research questions and friction points this paper is trying to address.

Evaluating textual versus visual representations of user behavior data
Determining optimal modality for multimodal LLM reasoning performance
Assessing image-based data representations for purchase prediction accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

BehaviorLens framework benchmarks modality trade-offs
Represents transaction data as text and images
Image representation improves prediction accuracy by 87.5%
🔎 Similar Papers
No similar papers found.
T
Tianning Dong
Personalization Team, Walmart Global Tech, Sunnyvale, California, USA
Luyi Ma
Luyi Ma
Walmart
Recommender SystemRepresentation LearningSeasonalityUser modeling
V
Varun Vasudevan
Personalization Team, Walmart Global Tech, Sunnyvale, California, USA
J
Jason H. D. Cho
Personalization Team, Walmart Global Tech, Sunnyvale, California, USA
S
Sushant Kumar
Personalization Team, Walmart Global Tech, Sunnyvale, California, USA
Kannan Achan
Kannan Achan
Walmartlabs
machine learningartificial intelligencegenerative modeling