EcomMMMU: Strategic Utilization of Visuals for Robust Multimodal E-Commerce Models

📅 2025-08-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In e-commerce multimodal understanding, product images do not universally enhance model performance and may even introduce redundancy. Existing datasets are limited in scale and task diversity, hindering systematic evaluation of image utility. To address this, we introduce EcomMMMU—a large-scale benchmark comprising 400K samples and 9M images—covering eight core e-commerce tasks and a dedicated Visual Utility Analysis Subset (VSS). Our analysis reveals substantial heterogeneity in image utility across tasks and instances. Building on this insight, we propose SUMEI, a strategy for selective image integration that combines visual utility prediction with a data-driven, dynamic multi-image selection mechanism. Extensive experiments demonstrate that SUMEI significantly outperforms naive image fusion baselines across multiple e-commerce tasks, achieving both superior effectiveness and robustness. This work establishes a novel paradigm for controllable, utility-aware visual integration in multimodal large language models.

Technology Category

Application Category

📝 Abstract
E-commerce platforms are rich in multimodal data, featuring a variety of images that depict product details. However, this raises an important question: do these images always enhance product understanding, or can they sometimes introduce redundancy or degrade performance? Existing datasets are limited in both scale and design, making it difficult to systematically examine this question. To this end, we introduce EcomMMMU, an e-commerce multimodal multitask understanding dataset with 406,190 samples and 8,989,510 images. EcomMMMU is comprised of multi-image visual-language data designed with 8 essential tasks and a specialized VSS subset to benchmark the capability of multimodal large language models (MLLMs) to effectively utilize visual content. Analysis on EcomMMMU reveals that product images do not consistently improve performance and can, in some cases, degrade it. This indicates that MLLMs may struggle to effectively leverage rich visual content for e-commerce tasks. Building on these insights, we propose SUMEI, a data-driven method that strategically utilizes multiple images via predicting visual utilities before using them for downstream tasks. Comprehensive experiments demonstrate the effectiveness and robustness of SUMEI. The data and code are available through https://anonymous.4open.science/r/submission25.
Problem

Research questions and friction points this paper is trying to address.

Evaluating if product images enhance or degrade multimodal model performance
Addressing limitations in existing e-commerce multimodal datasets' scale and design
Developing methods to strategically utilize multiple product images effectively
Innovation

Methods, ideas, or system contributions that make the work stand out.

EcomMMMU dataset with 406K multimodal samples
SUMEI method predicts visual utility strategically
Data-driven approach optimizes multi-image utilization
🔎 Similar Papers
No similar papers found.