🤖 AI Summary
This work addresses the limitations of general-purpose large language and multimodal models in e-commerce scenarios, where performance is hindered by insufficient domain knowledge, poor understanding of long-tail products, challenges in processing heterogeneous visual inputs, and inadequate support for multi-role interactions. To bridge this gap, the authors introduce the first unified multimodal e-commerce benchmark encompassing platforms, merchants, and consumers, structured around six core capability dimensions and 29 distinct tasks. The benchmark supports text–image multimodal inputs and both single- and multi-turn dialogues, leveraging real-world bilingual (Chinese–English) data annotated with a four-tier difficulty scale (P0–P3) and a visual key-evidence prioritization mechanism. This framework enables, for the first time, end-to-end evaluation across multiple roles, tasks, and modalities. Evaluations on 20 mainstream models reveal only moderate overall performance and narrowing performance gaps, underscoring these models’ heavy reliance on specialized e-commerce domain knowledge.
📝 Abstract
LLMs and MLLMs have become indispensable tools across a wide range of applications. E-commerce, however, poses distinctive challenges -- including intricate domain knowledge, long-tail product evidence, heterogeneous visual data, and the interplay among multiple stakeholder roles -- that diverge substantially from the general world knowledge these models are primarily trained on, often causing a notable gap between their open-domain and e-commerce performance. To systematically quantify this gap, we introduce OxyEcomBench, a unified multimodal benchmark comprising approximately 6,300 high-quality instances for real-world bilingual Chinese--English e-commerce. Although several e-commerce benchmarks have been proposed, they typically adopt a single stakeholder perspective, target a narrow set of tasks, or address isolated challenges, making it difficult to holistically assess models' understanding of the full e-commerce pipeline. OxyEcomBench addresses these limitations by jointly covering platform operators, merchants, and customers across 6 capability aspects and 29 tasks, supporting text-only and mixed-modality inputs with single-image, multi-image, single-turn, and multi-turn configurations. All data is sourced from authentic e-commerce platforms and verified by domain experts. The benchmark further adopts a difficulty-aware design with a four-level P0--P3 rubric applied to all 29 tasks whose difficulty admits stable expert consensus, and rigorously prioritizes visually salient multimodal cases in which key evidence resides in images rather than text alone. Evaluations on 20 mainstream LLMs and MLLMs show that even the leading models attain modest performance and that performance gaps narrow on OxyEcomBench, suggesting that insufficient e-commerce-specific knowledge infusion mutes the advantages of advanced general-purpose models in this domain.