🤖 AI Summary
Existing benchmarks are largely confined to academic or synthetic settings, failing to reflect the practical challenges faced by agents in real-world e-commerce applications. To address this gap, we propose EcomBench—the first comprehensive, real-world e-commerce benchmark—constructed from authentic user intents across leading global e-commerce platforms and covering diverse task types and a three-tiered difficulty spectrum. Methodologically, we deeply embed agent evaluation within actual e-commerce workflows; design a novel three-dimensional difficulty assessment framework balancing practicality, scalability, and expert validation; and employ real-user behavior mining, expert-coordinated annotation, and cross-platform abstraction to curate an open-source benchmark of over one thousand high-quality samples. EcomBench significantly enhances reproducibility and industry relevance in evaluating agent capabilities for e-commerce, and has already been adopted by multiple industrial agent teams for iterative model validation.
📝 Abstract
Foundation agents have rapidly advanced in their ability to reason and interact with real environments, making the evaluation of their core capabilities increasingly important. While many benchmarks have been developed to assess agent performance, most concentrate on academic settings or artificially designed scenarios while overlooking the challenges that arise in real applications. To address this issue, we focus on a highly practical real-world setting, the e-commerce domain, which involves a large volume of diverse user interactions, dynamic market conditions, and tasks directly tied to real decision-making processes. To this end, we introduce EcomBench, a holistic E-commerce Benchmark designed to evaluate agent performance in realistic e-commerce environments. EcomBench is built from genuine user demands embedded in leading global e-commerce ecosystems and is carefully curated and annotated through human experts to ensure clarity, accuracy, and domain relevance. It covers multiple task categories within e-commerce scenarios and defines three difficulty levels that evaluate agents on key capabilities such as deep information retrieval, multi-step reasoning, and cross-source knowledge integration. By grounding evaluation in real e-commerce contexts, EcomBench provides a rigorous and dynamic testbed for measuring the practical capabilities of agents in modern e-commerce.