🤖 AI Summary
Existing e-commerce web agent benchmarks suffer from two key limitations: narrow task coverage—restricted primarily to product search—and neglect of safety-critical risks such as erroneous user actions. To address these gaps, we propose Amazon-Bench, the first comprehensive evaluation benchmark designed for real-world e-commerce platforms (e.g., Amazon), encompassing diverse functionalities including product search, account management, and gift card operations. We introduce a functionality-oriented query generation pipeline that leverages webpage DOM structures and interactive elements (e.g., buttons, checkboxes) to automatically synthesize diverse, realistic user instructions. Furthermore, we develop an automated evaluation framework jointly optimizing for functional correctness and safety, uniquely quantifying hazardous behaviors—including accidental purchases, unintended address deletions, and misconfigured auto-reload settings. Experiments reveal that state-of-the-art web agents achieve low accuracy on complex tasks and exhibit substantial safety vulnerabilities, underscoring the urgent need for robust, reliable e-commerce agents.
📝 Abstract
Web agents have shown great promise in performing many tasks on ecommerce website. To assess their capabilities, several benchmarks have been introduced. However, current benchmarks in the e-commerce domain face two major problems. First, they primarily focus on product search tasks (e.g., Find an Apple Watch), failing to capture the broader range of functionalities offered by real-world e-commerce platforms such as Amazon, including account management and gift card operations. Second, existing benchmarks typically evaluate whether the agent completes the user query, but ignore the potential risks involved. In practice, web agents can make unintended changes that negatively impact the user account or status. For instance, an agent might purchase the wrong item, delete a saved address, or incorrectly configure an auto-reload setting. To address these gaps, we propose a new benchmark called Amazon-Bench. To generate user queries that cover a broad range of tasks, we propose a data generation pipeline that leverages webpage content and interactive elements (e.g., buttons, check boxes) to create diverse, functionality-grounded user queries covering tasks such as address management, wish list management, and brand store following. To improve the agent evaluation, we propose an automated evaluation framework that assesses both the performance and the safety of web agents. We systematically evaluate different agents, finding that current agents struggle with complex queries and pose safety risks. These results highlight the need for developing more robust and reliable web agents.