Towards Reliable Evaluation of Large Language Models for Multilingual and Multimodal E-Commerce Applications

πŸ“… 2025-10-23
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing e-commerce evaluation benchmarks suffer from task narrowness (e.g., lacking product guidance and after-sales support), modality limitations (predominantly text-only), synthetic data generation, and restricted language coverage (primarily English and Chinese), thus failing to reflect real-world shopping scenarios. To address these gaps, we propose EcomEvalβ€”the first multilingual, multimodal large language model (LLM) benchmark for e-commerce, constructed from authentic user queries and transaction logs and spanning six categories with 37 fine-grained tasks. It introduces a novel fine-grained difficulty taxonomy, supports five low-resource Southeast Asian languages, and employs a semi-automated annotation pipeline combining LLM generation with rigorous manual review by 50+ domain experts to ensure reference answer quality. EcomEval enables stratified capability assessment across model scales and challenge-oriented analysis. By unifying task diversity, linguistic breadth, and data authenticity, it establishes a scalable, high-fidelity evaluation infrastructure for multilingual, multimodal e-commerce AI.

Technology Category

Application Category

πŸ“ Abstract
Large Language Models (LLMs) excel on general-purpose NLP benchmarks, yet their capabilities in specialized domains remain underexplored. In e-commerce, existing evaluations-such as EcomInstruct, ChineseEcomQA, eCeLLM, and Shopping MMLU-suffer from limited task diversity (e.g., lacking product guidance and after-sales issues), limited task modalities (e.g., absence of multimodal data), synthetic or curated data, and a narrow focus on English and Chinese, leaving practitioners without reliable tools to assess models on complex, real-world shopping scenarios. We introduce EcomEval, a comprehensive multilingual and multimodal benchmark for evaluating LLMs in e-commerce. EcomEval covers six categories and 37 tasks (including 8 multimodal tasks), sourced primarily from authentic customer queries and transaction logs, reflecting the noisy and heterogeneous nature of real business interactions. To ensure both quality and scalability of reference answers, we adopt a semi-automatic pipeline in which large models draft candidate responses subsequently reviewed and modified by over 50 expert annotators with strong e-commerce and multilingual expertise. We define difficulty levels for each question and task category by averaging evaluation scores across models with different sizes and capabilities, enabling challenge-oriented and fine-grained assessment. EcomEval also spans seven languages-including five low-resource Southeast Asian languages-offering a multilingual perspective absent from prior work.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' capabilities in specialized e-commerce domains with limited task diversity.
Addressing the absence of multimodal data and narrow language focus in e-commerce evaluations.
Providing reliable tools for assessing models on complex real-world shopping scenarios.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Semi-automatic pipeline generates expert-reviewed reference answers
Multimodal tasks sourced from authentic customer queries
Difficulty levels defined by cross-model evaluation scores
πŸ”Ž Similar Papers
No similar papers found.
S
Shuyi Xie
Shopee
Z
Ziqin Liew
Shopee
H
Hailing Zhang
Shopee
H
Haibo Zhang
Shopee
Ling Hu
Ling Hu
Shopee
Zhiqiang Zhou
Zhiqiang Zhou
Beijing Institute of Technology
Computer VisionInformation Fusion
S
Shuman Liu
Shopee
A
Anxiang Zeng
Shopee