DeepShop: A Benchmark for Deep Research Shopping Agents

📅 2025-06-03

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

Existing shopping agent evaluation benchmarks oversimplify real-world complexity, failing to capture multidimensional attributes, dynamic filtering, and personalized ranking. To address this, we propose DeepShop—the first benchmark tailored to complex, realistic shopping scenarios. Methodologically, (1) we construct a multi-domain, real-user query evolution dataset incorporating co-evolutionary mechanisms for query diversity and complexity; (2) we design a fine-grained evaluation framework covering three core components—attribute understanding, filtering, and ranking—and define three realism-based difficulty levels (easy/medium/hard); (3) we develop an automated web-interaction evaluation engine supporting cross-method assessment of RAG systems, web agents, and deep research architectures. Empirical results reveal significant weaknesses in current methods: filtering and ranking performance is notably poor, and RAG fails on over 80% of complex queries. DeepShop establishes a new, rigorous baseline for shopping agent evaluation.

Technology Category

Application Category

📝 Abstract

Web agents for online shopping have shown great promise in automating user interactions across e-commerce platforms. Benchmarks for assessing such agents do not reflect the complexity of real-world shopping scenarios, as they often consist of overly simple queries with deterministic paths, such as"Find iPhone 15."Real shopping scenarios are inherently more layered, involving multi-dimensional product attributes, search filters, and user-specific sorting preferences. To address this gap, we introduce DeepShop, a benchmark designed to evaluate web agents in complex and realistic online shopping environments. DeepShop comprises three key components. (1) Query diversity evolution: Starting from real user queries, we generate diverse queries across five popular online shopping domains. (2) Query complexity evolution: We further evolve these queries to increase complexity, considering product attributes, search filters, and sorting preferences, and classify them into three levels: easy, medium, and hard, based on the number of evolutions. (3) Fine-grained and holistic evaluation: We propose an automated evaluation framework that assesses agent performance in terms of fine-grained aspects (product attributes, search filters, and sorting preferences) and reports the overall success rate through holistic evaluation. We conduct a systematic evaluation of retrieval-augmented generation (RAG) methods, web agents, and deep research systems. Results show that RAG struggles with complex queries due to its lack of web interaction, while other methods face significant challenges with filters and sorting preferences, leading to low overall success rates. We also perform cross-category, complexity-based evaluations and error analyses to support the advancement of deep research shopping agents.

Problem

Research questions and friction points this paper is trying to address.

Lack of benchmarks reflecting real-world shopping complexity

Need for evaluating agents in multi-dimensional shopping scenarios

Challenges in handling filters and sorting preferences effectively

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates diverse queries from real user inputs

Evolves queries for complexity with attributes and filters

Automated evaluation framework for fine-grained performance

🔎 Similar Papers

Deep Reinforcement Learning for Dynamic Order Picking in Warehouse Operations

2024-08-03arXiv.orgCitations: 0

TikTok

San Jose, California

Authors to Follow