WebMall -- A Multi-Shop Benchmark for Evaluating Web Agents

📅 2025-08-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of evaluating large language model (LLM)-driven web agents in cross-store price-comparison shopping. To this end, we introduce WebMall—the first benchmark tailored to realistic e-commerce scenarios—comprising four simulated online stores derived from Common Crawl product data and 91 fine-grained tasks covering price comparison, checkout, fuzzy search, substitute identification, and compatibility verification. Crucially, WebMall is the first to incorporate long-horizon, cross-heterogeneous-store interaction tasks, substantially enhancing ecological validity and difficulty. We evaluate eight state-of-the-art LLM-based agents—including GPT-4.1 and Claude Sonnet 4—achieving peak task completion rates of 75% (basic) and 53% (advanced), with corresponding F1 scores of 87% and 63%. Results expose critical bottlenecks in complex shopping reasoning and cross-store coordination, thereby filling a key gap in embodied agent evaluation for e-commerce.

Technology Category

Application Category

📝 Abstract
LLM-based web agents have the potential to automate long-running web tasks, such as finding offers for specific products in multiple online shops and subsequently ordering the cheapest products that meet the users needs. This paper introduces WebMall, a multi-shop online shopping benchmark for evaluating the effectiveness and efficiency of web agents for comparison-shopping. WebMall consists of four simulated online shops populated with authentic product offers sourced from the Common Crawl, alongside a suite of 91 cross-shop tasks. These tasks include basic tasks such as finding specific products in multiple shops, performing price comparisons, adding items to the shopping cart, and completing checkout. Advanced tasks involve searching for products based on vague requirements, identifying suitable substitutes, and finding compatible products. Compared to existing e-commerce benchmarks, such as WebShop or ShoppingBench, WebMall introduces comparison-shopping tasks across multiple shops. Furthermore, the product offers are more heterogeneous, as they originate from hundreds of distinct real-world shops. The tasks in WebMall require longer interaction trajectories than those in WebShop, while remaining representative of real-world shopping behaviors. We evaluate eight baseline agents on WebMall, varying in observation modality, memory utilization, and underlying large language model (GPT 4.1 and Claude Sonnet 4). The best-performing configurations achieve completion rates of 75% and 53%, and F1 scores of 87% and 63%, on the basic and advanced task sets, respectively. WebMall is publicly released to facilitate research on web agents and to promote advancements in navigation, reasoning, and efficiency within e-commerce scenarios.
Problem

Research questions and friction points this paper is trying to address.

Evaluating web agents for multi-shop comparison-shopping tasks
Assessing agent performance on heterogeneous real-world product offers
Measuring navigation and reasoning in long e-commerce interaction trajectories
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-shop benchmark for web agents
Authentic product offers from Common Crawl
Evaluates navigation and reasoning in e-commerce
🔎 Similar Papers
No similar papers found.