Spatio-Temporal Token Pruning for Efficient High-Resolution GUI Agents

📅 2026-02-26

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

This work addresses the inefficiency and spatial hallucination issues in high-resolution GUI agents caused by substantial spatiotemporal redundancy in screenshots and historical trajectories. To this end, the authors propose GUIPruner, a training-free visual token compression framework that, for the first time, identifies temporal misalignment and spatial topological conflicts in existing methods. GUIPruner introduces Temporal Adaptive Resolution (TAR) and Hierarchical Structure-aware Pruning (SSP) to jointly preserve interactive foregrounds, semantic anchors, and global layout. Evaluated on Qwen2-VL-2B, GUIPruner reduces FLOPs by 3.4× and accelerates visual encoding by 3.3× while retaining over 94% of the original performance, significantly outperforming current compression approaches.

Technology Category

Application Category

📝 Abstract

Pure-vision GUI agents provide universal interaction capabilities but suffer from severe efficiency bottlenecks due to the massive spatiotemporal redundancy inherent in high-resolution screenshots and historical trajectories. We identify two critical misalignments in existing compression paradigms: the temporal mismatch, where uniform history encoding diverges from the agent's "fading memory" attention pattern, and the spatial topology conflict, where unstructured pruning compromises the grid integrity required for precise coordinate grounding, inducing spatial hallucinations. To address these challenges, we introduce GUIPruner, a training-free framework tailored for high-resolution GUI navigation. It synergizes Temporal-Adaptive Resolution (TAR), which eliminates historical redundancy via decay-based resizing, and Stratified Structure-aware Pruning (SSP), which prioritizes interactive foregrounds and semantic anchors while safeguarding global layout. Extensive evaluations across diverse benchmarks demonstrate that GUIPruner consistently achieves state-of-the-art performance, effectively preventing the collapse observed in large-scale models under high compression. Notably, on Qwen2-VL-2B, our method delivers a 3.4x reduction in FLOPs and a 3.3x speedup in vision encoding latency while retaining over 94% of the original performance, enabling real-time, high-precision navigation with minimal resource consumption.

Problem

Research questions and friction points this paper is trying to address.

spatio-temporal redundancy

temporal mismatch

spatial topology conflict

high-resolution GUI agents

token pruning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Spatio-Temporal Token Pruning

Temporal-Adaptive Resolution

Stratified Structure-aware Pruning