AQuaUI: Visual Token Reduction for GUI Agents with Adaptive Quadtrees

📅 2026-05-18

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

This work addresses the challenge of highly non-uniform visual information distribution in GUI screenshots, which hinders conventional compression methods from simultaneously eliminating redundancy and preserving critical details. The authors propose a training-free, inference-time visual token compression approach that introduces adaptive quadtree partitioning into GUI agents for the first time. By leveraging spatially adaptive subdivision, leaf-node token merging, and consistent positional encoding, the method effectively exploits spatial redundancy inherent in GUI layouts. Furthermore, cross-step conditional optimization is integrated to enhance temporal consistency across interaction sequences. Evaluated on GUI-Owl-1.5-32B-Instruct, the approach achieves up to 29.52% token compression and 13.22% inference speedup while retaining 99.06% of the original task performance.

📝 Abstract

Large Multimodal Models (LMMs) have recently emerged as promising backbones for GUI-agent models, where high-resolution GUI screenshots are introduced to the prompts at each iteration step. However, these screenshots exhibit highly non-uniform spatial information density: large regions may carry little information and are visually homogeneous, while key text and icons may require high visual fidelity. Existing approaches to this problem either require additional training or rely on attention-based token compression, ignoring the structured layout and spatial redundancy of GUI screenshots. To fill the gap, this paper proposes AquaUI, a training-free inference-time token reduction method for GUI agent models that utilizes the non-uniform information density in screenshots. AQuaUI constructs an adaptive quadtree on each screenshot input and keeps one representative merged token per leaf of the quadtree. AQuaUI preserves the spatial positions of retained tokens throughout the pipeline to ensure that all position-encoding stages remain consistent. To further improve temporal consistency across multi-step GUI interactions, we propose a conditional quadtree algorithm that leverages the continuity between consecutive screenshots within a single request. Specifically, it refines the current quadtree using previous quadtrees as references, helping preserve fine-grained regions across static or mildly shifted GUI states. We implement AQuaUI on state-of-the-art GUI agent models and conduct experiments on standard grounding and navigational benchmarks. AQuaUI consistently shows improved accuracy-efficiency trade-offs over prior baselines. Notably, on GUI-Owl-1.5-32B-Instruct, AQuaUI achieves up to 13.22% speedup and 29.52% fewer visual tokens while retaining 99.06% of full-token performance, suggesting that the spatial redundancy of GUI screenshots can be exploited at inference without retraining.

Problem

Research questions and friction points this paper is trying to address.

GUI agents

visual token reduction

spatial redundancy

information density

multimodal models

Innovation

Methods, ideas, or system contributions that make the work stand out.

adaptive quadtree

token reduction

GUI agents