VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation

📅 2026-04-23
📈 Citations: 0
Influential: 0
📄 PDF

career value

202K/year
🤖 AI Summary
This work addresses the common failure modes of autonomous GUI agents—premature termination and repetitive looping—by introducing a modular multi-agent framework that integrates a mandatory completion validator, a loop-breaker mechanism, and an on-demand search agent. These components operate in concert with coordinated encoding and localization modules to dynamically decide whether to halt, resume, or initiate a search strategy. The study presents the first systematic incorporation of enforced validation and multi-level loop-breaking strategies, synergistically combining vision-language models, screen state tracking, LLM-powered online search, and precise action generation. Evaluated on OSWorld and WindowsAgentArena, the approach achieves success rates of 77.5% and 61.0%, respectively, with its backbone model surpassing human performance (72.4%) in single-run trials. The loop-breaker component alone reduces ineffective steps by nearly 50%.

Technology Category

Application Category

📝 Abstract
Autonomous GUI agents face two fundamental challenges: early stopping, where agents prematurely declare success without verifiable evidence, and repetitive loops, where agents cycle through the same failing actions without recovery. We present VLAA-GUI, a modular GUI agentic framework built around three integrated components that guide the system on when to Stop, Recover, and Search. First, a mandatory Completeness Verifier enforces UI-observable success criteria and verification at every finish step -- with an agent-level verifier that cross-examines completion claims with decision rules, rejecting those lacking direct visual evidence. Second, a mandatory Loop Breaker provides multi-tier filtering: switching interaction mode after repeated failures, forcing strategy changes after persistent screen-state recurrence, and binding reflection signals to strategy shifts. Third, an on-demand Search Agent searches online for unfamiliar workflows by directly querying a capable LLM with search ability, returning results as plain text. We additionally integrate a Coding Agent for code-intensive actions and a Grounding Agent for precise action grounding, both invoked on demand when required. We evaluate VLAA-GUI across five top-tier backbones, including Opus 4.5, 4.6 and Gemini 3.1 Pro, on two benchmarks with Linux and Windows tasks, achieving top performance on both (77.5% on OSWorld and 61.0% on WindowsAgentArena). Notably, three of the five backbones surpass human performance (72.4%) on OSWorld in a single pass. Ablation studies show that all three proposed components consistently improve a strong backbone, while a weaker backbone benefits more from these tools when the step budget is sufficient. Further analysis also shows that the Loop Breaker nearly halves wasted steps for loop-prone models.
Problem

Research questions and friction points this paper is trying to address.

early stopping
repetitive loops
GUI automation
autonomous agents
task verification
Innovation

Methods, ideas, or system contributions that make the work stand out.

GUI Automation
Early Stopping Prevention
Loop Breaking
Modular Agentic Framework
LLM-based Search
🔎 Similar Papers
No similar papers found.