RiskWebWorld: A Realistic Interactive Benchmark for GUI Agents in E-commerce Risk Management

📅 2026-04-15
📈 Citations: 0
Influential: 0
📄 PDF

career value

208K/year
🤖 AI Summary
This work addresses the lack of suitable benchmarks for evaluating GUI-based agents in high-stakes, adversarial e-commerce fraud detection scenarios, as existing benchmarks primarily target general consumer applications. The authors introduce the first high-fidelity, interactive evaluation benchmark specifically designed for e-commerce risk control, encompassing 1,513 real-world tasks across eight core domains. They innovatively formalize authentic risk-control workflows into Gymnasium-compatible interactive environments, decoupling policy learning from environment dynamics to enable reinforcement learning training and scalable evaluation. Experimental results reveal that state-of-the-art general-purpose models achieve only a 49.1% zero-shot success rate, while specialized open-source GUI agents largely fail; however, fine-tuning these models via reinforcement learning yields a significant 16.2% performance improvement.

Technology Category

Application Category

📝 Abstract
Graphical User Interface (GUI) agents show strong capabilities for automating web tasks, but existing interactive benchmarks primarily target benign, predictable consumer environments. Their effectiveness in high-stakes, investigative domains such as authentic e-commerce risk management remains underexplored. To bridge this gap, we present RiskWebWorld, the first highly realistic interactive benchmark for evaluating GUI agents in e-commerce risk management. RiskWebWorld features 1,513 tasks sourced from production risk-control pipelines across 8 core domains, and captures the authentic challenges of risk operations on uncooperative websites, partially environmental hijackments. To support scalable evaluation and agentic reinforcement learning (RL), we further build a Gymnasium-compliant infrastructure that decouples policy planning from environment mechanics. Our evaluation across diverse models reveals a dramatic capability gap: top-tier generalist models achieve 49.1% success, while specialized open-weights GUI models lag at near-total failure. This highlights that foundation model scale currently matters more than zero-shot interface grounding in long-horizon professional tasks. We also demonstrate the viability of our infrastructure through agentic RL, which improves open-source models by 16.2%. These results position RiskWebWorld as a practical testbed for developing robust digital workers.
Problem

Research questions and friction points this paper is trying to address.

GUI agents
e-commerce risk management
interactive benchmark
realistic evaluation
risk operations
Innovation

Methods, ideas, or system contributions that make the work stand out.

GUI agents
e-commerce risk management
interactive benchmark
reinforcement learning
Gymnasium-compliant infrastructure
🔎 Similar Papers
No similar papers found.