🤖 AI Summary
Existing evaluation methods for e-commerce agents struggle to balance realism and controllability: real websites lack reproducibility, while synthetic sandboxes suffer from insufficient diversity and unrealistic structure. To address this limitation, this work proposes ShopGym, a novel framework that leverages ShopArena to transform live e-commerce pages into resettable, inspectable simulated stores, and employs ShopGuru to generate 224 benchmark tasks spanning seven core agent capabilities. ShopGym is the first framework to simultaneously satisfy realism, diversity, controllability, inspectability, and reproducibility. Experimental results demonstrate that the synthesized stores preserve key structural characteristics of real websites, and agent performance in the simulated environment exhibits strong positive correlation with that in real-world settings.
📝 Abstract
Developing and evaluating e-commerce web agents requires environments that preserve meaningful task structure while enabling controllable, reproducible, and scalable scientific comparison. Existing methodologies force a tradeoff: live storefronts provide realism but are non-stationary, difficult to inspect, and irreproducible, while hand-built sandbox benchmarks provide control but cover only a narrow range of layouts, catalogs, policies, and interaction patterns. We argue that the core bottleneck is methodological: the field lacks a scalable way to construct evaluation settings that are simultaneously realistic, diverse, controllable, inspectable, and reproducible. We introduce ShopGym, an integrated framework for realistic simulation and scalable benchmarking of e-commerce web agents. ShopGym is a framework for constructing e-commerce simulation environments and grounded benchmark tasks. Its simulation layer, ShopArena, converts live seed storefronts into self-contained sandbox shops through anonymized shop specifications and a staged, validated generation process. On top of these simulated storefronts, ShopGuru synthesizes benchmark tasks across seven skill categories, grounding each task in the shop's catalog, navigation structure, policies, and interaction affordances. Together, ShopArena and ShopGuru produce self-contained, resettable, inspectable, and stable evaluation artifacts that preserve structural properties and agent-evaluation signals relevant to shopping tasks. We validate the framework through graph-based structural analysis and agent-based behavioral evaluation with 224 generated tasks across six sandbox shops: three constructed with synthetic data and three with real data. Our results show that the synthetic shops preserve key structural properties of live storefronts, with agent performance on synthetic shops positively correlated with performance on live storefronts.