ShoppingComp: Are LLMs Really Ready for Your Shopping Cart?

📅 2025-11-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper identifies critical capability gaps of large language models (LLMs) in real-world e-commerce settings—particularly in safety-critical tasks such as precise product retrieval, expert-level report generation, and hazard identification (e.g., recommending dangerous items or succumbing to promotional bias). Method: We introduce ShoppingComp, the first safety-oriented benchmark for e-commerce agents, featuring an expert-annotated dataset of 1,026 realistic scenarios across 120 tasks, and propose a multidimensional evaluation framework that jointly assesses product authenticity, result verifiability, and safety risk detection. Contribution/Results: ShoppingComp establishes the first safety-centric evaluation standard for e-commerce agents. Experimental results reveal alarmingly low performance across mainstream models: GPT-5 achieves only 11.22%, and Gemini 2.5 Flash just 3.92%, underscoring substantial risks in real-world deployment.

Technology Category

Application Category

📝 Abstract
We present ShoppingComp, a challenging real-world benchmark for rigorously evaluating LLM-powered shopping agents on three core capabilities: precise product retrieval, expert-level report generation, and safety critical decision making. Unlike prior e-commerce benchmarks, ShoppingComp introduces highly complex tasks under the principle of guaranteeing real products and ensuring easy verifiability, adding a novel evaluation dimension for identifying product safety hazards alongside recommendation accuracy and report quality. The benchmark comprises 120 tasks and 1,026 scenarios, curated by 35 experts to reflect authentic shopping needs. Results reveal stark limitations of current LLMs: even state-of-the-art models achieve low performance (e.g., 11.22% for GPT-5, 3.92% for Gemini-2.5-Flash). These findings highlight a substantial gap between research benchmarks and real-world deployment, where LLMs make critical errors such as failure to identify unsafe product usage or falling for promotional misinformation, leading to harmful recommendations. ShoppingComp fills the gap and thus establishes a new standard for advancing reliable and practical agents in e-commerce.
Problem

Research questions and friction points this paper is trying to address.

Evaluates LLMs on precise product retrieval and expert report generation
Assesses safety-critical decision making to identify product hazards
Highlights performance gap between benchmarks and real-world deployment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces ShoppingComp benchmark for real-world LLM evaluation
Includes product safety hazard identification as novel dimension
Comprises 120 tasks curated by experts for authenticity
🔎 Similar Papers
No similar papers found.