Recovering Policy-Induced Errors: Benchmarking and Trajectory Synthesis for Robust GUI Agents

📅 2026-05-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing GUI agents exhibit insufficient robustness in real-world scenarios, particularly in their inability to recover from their own errors. To address this limitation, this work introduces GUI-RobustEval, the first systematic benchmark encompassing diverse error patterns for evaluating GUI agent robustness. Furthermore, we propose RoTS, a robustness-driven tree-based trajectory synthesis framework that actively identifies failure points and generates recovery paths through an extensible tree structure, integrated with large-scale instruction tuning and executable test case generation. Models trained with this framework—RoTS-7B and RoTS-32B—demonstrate substantially enhanced error recovery capabilities, achieving state-of-the-art performance on both GUI-RobustEval and OSWorld. Notably, RoTS-32B attains a success rate of 47.4% and an All-Pass@4 score of 33.8% on OSWorld.
📝 Abstract
While GUI agents have advanced rapidly, they often lack the robustness to recover from their own errors, hindering real-world deployment. To bridge this gap at both the evaluation and data levels, we introduce GUI-RobustEval and propose Robustness-driven Trajectory Synthesis. GUI-RobustEval contains $1,216$ executable test cases that systematically measure error recovery capabilities across a broad and realistic spectrum of error modes. At the data level, RoTS is a scalable synthesis framework that creates $800k$ high-quality data via a tree-based pipeline that proactively discovers diverse error modes and synthesizes corresponding recovery steps. Our two models, RoTS-7B and RoTS-32B, fine-tuned on our dataset, both demonstrate significant gains on GUI-RobustEval and traditional GUI benchmarks. Notably, RoTS-32B achieves state-of-the-art performance on OSWorld, with a $47.4\%$ success rate and a $33.8\%$ All-Pass@4 score, suggesting that improved long-horizon error recovery ability contributes to both robustness and overall performance. Our code is available at https://github.com/AlibabaResearch/RoTS.
Problem

Research questions and friction points this paper is trying to address.

GUI agents
error recovery
robustness
policy-induced errors
trajectory synthesis
Innovation

Methods, ideas, or system contributions that make the work stand out.

error recovery
trajectory synthesis
GUI agents
robustness benchmarking
tree-based data generation
T
Tianpeng Bu
Alibaba Cloud Computing
X
Xin Liu
Alibaba Cloud Computing
Q
Qihua Chen
Alibaba Cloud Computing
Hao Jiang
Hao Jiang
Alibaba Group
LLM & AIGC
S
Shurui Li
Alibaba Cloud Computing
Hongtao Duan
Hongtao Duan
Nanjing Institute of Geography and Limnology, Chinese Academy of Sciences
Ocean color remote sensingLake remote sensing
Lu Jiang
Lu Jiang
Research Scientist @ Apple
Generative AIFoundation ModelRobust Deep LearningMultimediaVideo Generation
L
Lulu Hu
Alibaba Cloud Computing
B
Bin Yang
Alibaba Cloud Computing
M
Minying Zhang
Alibaba Cloud Computing