🤖 AI Summary
Automated GUI agents face two critical bottlenecks: scarcity of annotated data for GUI grounding and absence of retrospective modeling of historical actions during planning. This paper introduces a unified agent framework for multi-platform GUI environments. To address data scarcity, we propose *scalable GUI annotation data synthesis*, the first method of its kind, generating high-fidelity, diverse training samples from multi-source templates. For planning, we design *bidirectional image-to-action modeling*—integrating forward action prediction with backward path retrospection—to explicitly capture GUI state evolution dynamics. Our approach unifies template-based sample generation, synthetic data augmentation, joint grounding-planning training, and vision-driven action generation with history-aware path reconstruction. Evaluated on cross-platform Web/Mobile benchmarks, our method achieves an average 12.6% higher task completion rate over state-of-the-art models, demonstrating substantial improvements in generalization and interpretability.
📝 Abstract
Automated GUI agents aims to facilitate user interaction by automatically performing complex tasks in digital environments, such as web, mobile, desktop devices. It receives textual task instruction and GUI description to generate executable actions (emph{e.g.}, click) and operation boxes step by step. Training a GUI agent mainly involves grounding and planning stages, in which the GUI grounding focuses on finding the execution coordinates according to the task, while the planning stage aims to predict the next action based on historical actions. However, previous work suffers from the limitations of insufficient training data for GUI grounding, as well as the ignorance of backtracking historical behaviors for GUI planning. To handle the above challenges, we propose ScaleTrack, a training framework by scaling grounding and backtracking planning for automated GUI agents. We carefully collected GUI samples of different synthesis criterions from a wide range of sources, and unified them into the same template for training GUI grounding models. Moreover, we design a novel training strategy that predicts the next action from the current GUI image, while also backtracking the historical actions that led to the GUI image. In this way, ScaleTrack explains the correspondence between GUI images and actions, which effectively describes the evolution rules of the GUI environment. Extensive experimental results demonstrate the effectiveness of ScaleTrack. Data and code will be available at url.