To See or To Read: User Behavior Reasoning in Multimodal LLMs

📅 2025-11-05

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

This study investigates the relative effectiveness of visual versus textual modalities in multimodal large language models (MLLMs) for user behavior reasoning. Addressing the lack of systematic modality comparison in prior work, we propose BehaviorLens—a framework that conducts the first comprehensive evaluation of six MLLMs on real-world purchase sequence data, assessing their modeling capabilities across three behavioral representations: textual paragraphs, scatter plots, and flowcharts. Results demonstrate that visual representations—particularly scatter plots—substantially outperform textual descriptions, improving next-purchase prediction accuracy by 87.5% without increasing computational overhead. This finding reveals the unique advantage of visual encoding in behavioral sequence modeling and provides empirical grounding and a new paradigm for modality selection in MLLMs targeting user behavior understanding.

Technology Category

Application Category

📝 Abstract

Multimodal Large Language Models (MLLMs) are reshaping how modern agentic systems reason over sequential user-behavior data. However, whether textual or image representations of user behavior data are more effective for maximizing MLLM performance remains underexplored. We present exttt{BehaviorLens}, a systematic benchmarking framework for assessing modality trade-offs in user-behavior reasoning across six MLLMs by representing transaction data as (1) a text paragraph, (2) a scatter plot, and (3) a flowchart. Using a real-world purchase-sequence dataset, we find that when data is represented as images, MLLMs next-purchase prediction accuracy is improved by 87.5% compared with an equivalent textual representation without any additional computational cost.

Problem

Research questions and friction points this paper is trying to address.

Evaluating textual versus visual representations of user behavior data

Determining optimal modality for multimodal LLM reasoning performance

Assessing image-based data representations for purchase prediction accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

BehaviorLens framework benchmarks modality trade-offs

Represents transaction data as text and images

Image representation improves prediction accuracy by 87.5%

🔎 Similar Papers

From Introspection to Best Practices: Principled Analysis of Demonstrations in Multimodal In-Context Learning