🤖 AI Summary
This study investigates the relative effectiveness of visual versus textual modalities in multimodal large language models (MLLMs) for user behavior reasoning. Addressing the lack of systematic modality comparison in prior work, we propose BehaviorLens—a framework that conducts the first comprehensive evaluation of six MLLMs on real-world purchase sequence data, assessing their modeling capabilities across three behavioral representations: textual paragraphs, scatter plots, and flowcharts. Results demonstrate that visual representations—particularly scatter plots—substantially outperform textual descriptions, improving next-purchase prediction accuracy by 87.5% without increasing computational overhead. This finding reveals the unique advantage of visual encoding in behavioral sequence modeling and provides empirical grounding and a new paradigm for modality selection in MLLMs targeting user behavior understanding.
📝 Abstract
Multimodal Large Language Models (MLLMs) are reshaping how modern agentic systems reason over sequential user-behavior data. However, whether textual or image representations of user behavior data are more effective for maximizing MLLM performance remains underexplored. We present exttt{BehaviorLens}, a systematic benchmarking framework for assessing modality trade-offs in user-behavior reasoning across six MLLMs by representing transaction data as (1) a text paragraph, (2) a scatter plot, and (3) a flowchart. Using a real-world purchase-sequence dataset, we find that when data is represented as images, MLLMs next-purchase prediction accuracy is improved by 87.5% compared with an equivalent textual representation without any additional computational cost.