🤖 AI Summary
This work addresses the cascading failure of topological structures in precision-sensitive GUI tasks caused by pixel-level coordinate errors. To tackle this challenge, the authors propose a topology-aware agent framework that integrates dependency-based planning with pixel-level execution. They formally define this task paradigm and introduce PAGE Bench, a benchmark comprising 224K pixel-level actions, along with a state-conditioned geometric feedback mechanism to mitigate execution drift. Executable action syntax is established through pixel-anchored supervised fine-tuning, complemented by accuracy-aligned reinforcement learning and process supervision to ensure geometric-topological consistency. Experimental results demonstrate that the proposed approach improves task success rate by 4.1× over the strongest general-purpose baseline, elevating step-wise success from under 9% to over 62%, substantially outperforming existing GUI agents.
📝 Abstract
Large vision-language models have significantly advanced GUI agents, enabling executable interaction across web, mobile, and desktop interfaces. Yet these gains largely rely on a forgiving region-tolerant paradigm, where many nearby pixels inside the same component remain valid. Precise geometric construction breaks this assumption: actions must land on points in continuous canvas space rather than tolerant regions. Because geometric primitives carry ontological dependencies, a local coordinate error can induce cascading topological failures that distort downstream objects and invalidate the final construction. We identify this regime as precision-sensitive GUI tasks, requiring point-level accuracy, geometry-aware verification, and robustness to dependency-driven error propagation. To benchmark it, we introduce PAGE Bench, with 4,906 problems and over 224K process-supervised, pixel-level GUI actions. We further propose PAGER, a topology-aware agent that decomposes construction into dependency-structured planning and pixel-level execution. Pixel-grounded supervised tuning establishes executable action grammar, while precision-aligned reinforcement learning mitigates rollout-induced exposure bias through state-conditioned geometric feedback. Experiments reveal a pronounced Semantic-Execution Gap: general multimodal models can exceed 88% action type accuracy yet remain below 6% task success. PAGER closes this gap, delivering 4.1x higher task success than the strongest evaluated general baseline and raising step success rate from below 9% for GUI-specialized agents to over 62%, establishing a new state of the art for point-precise GUI control.