ChineseEcomQA: A Scalable E-commerce Concept Evaluation Benchmark for Large Language Models

📅 2025-02-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing LLM evaluation benchmarks for e-commerce suffer from insufficient coverage of task heterogeneity, ambiguity between general and domain-specific capabilities, and difficulties in verifying factual accuracy. This paper introduces ChineseEcomQA, a novel Chinese e-commerce conceptual QA benchmark, featuring a pioneering three-dimensional design principle: “foundational concepts + e-commerce generality + e-commerce specificity.” We propose an integrated, scalable data construction paradigm combining LLM-based automated verification, RAG-enabled consistency checking, and multi-round expert annotation. The benchmark comprises 2,100+ high-quality QA pairs, systematically exposing factual errors, weak generalization, and domain transfer bottlenecks of mainstream Chinese LLMs in e-commerce concept understanding. Experiments demonstrate that ChineseEcomQA significantly improves evaluation reproducibility, attribution fidelity, and domain adaptability—establishing the first standardized, concept-level evaluation tool for e-commerce AI model development and deployment.

Technology Category

Application Category

📝 Abstract
With the increasing use of Large Language Models (LLMs) in fields such as e-commerce, domain-specific concept evaluation benchmarks are crucial for assessing their domain capabilities. Existing LLMs may generate factually incorrect information within the complex e-commerce applications. Therefore, it is necessary to build an e-commerce concept benchmark. Existing benchmarks encounter two primary challenges: (1) handle the heterogeneous and diverse nature of tasks, (2) distinguish between generality and specificity within the e-commerce field. To address these problems, we propose extbf{ChineseEcomQA}, a scalable question-answering benchmark focused on fundamental e-commerce concepts. ChineseEcomQA is built on three core characteristics: extbf{Focus on Fundamental Concept}, extbf{E-commerce Generality} and extbf{E-commerce Expertise}. Fundamental concepts are designed to be applicable across a diverse array of e-commerce tasks, thus addressing the challenge of heterogeneity and diversity. Additionally, by carefully balancing generality and specificity, ChineseEcomQA effectively differentiates between broad e-commerce concepts, allowing for precise validation of domain capabilities. We achieve this through a scalable benchmark construction process that combines LLM validation, Retrieval-Augmented Generation (RAG) validation, and rigorous manual annotation. Based on ChineseEcomQA, we conduct extensive evaluations on mainstream LLMs and provide some valuable insights. We hope that ChineseEcomQA could guide future domain-specific evaluations, and facilitate broader LLM adoption in e-commerce applications.
Problem

Research questions and friction points this paper is trying to address.

Evaluate LLMs in e-commerce domain
Address heterogeneous e-commerce tasks
Balance e-commerce generality and specificity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Scalable e-commerce QA benchmark
LLM and RAG validation
Balanced generality and specificity
🔎 Similar Papers
No similar papers found.
H
Haibin Chen
Taobao & Tmall Group of Alibaba, Hangzhou, China
K
Kangtao Lv
Taobao & Tmall Group of Alibaba, Hangzhou, China
Chengwei Hu
Chengwei Hu
Fudan University
NLPKG
Y
Yanshi Li
Taobao & Tmall Group of Alibaba, Beijing, China
Y
Yujin Yuan
Taobao & Tmall Group of Alibaba, Hangzhou, China
Yancheng He
Yancheng He
Alibaba Group
LLM
Xingyao Zhang
Xingyao Zhang
Microsoft
Langming Liu
Langming Liu
PhD, City University of Hongkong
RecommendationLarge Language ModelsFederated Learning
S
Shilei Liu
Taobao & Tmall Group of Alibaba, Hangzhou, China
W
Wenbo Su
Taobao & Tmall Group of Alibaba, Beijing, China
B
Bo Zheng
Taobao & Tmall Group of Alibaba, Beijing, China