π€ AI Summary
This work addresses the inefficiency and spatial hallucination issues in high-resolution GUI agents caused by substantial spatiotemporal redundancy in screenshots and historical trajectories. To this end, the authors propose GUIPruner, a training-free visual token compression framework that, for the first time, identifies temporal misalignment and spatial topological conflicts in existing methods. GUIPruner introduces Temporal Adaptive Resolution (TAR) and Hierarchical Structure-aware Pruning (SSP) to jointly preserve interactive foregrounds, semantic anchors, and global layout. Evaluated on Qwen2-VL-2B, GUIPruner reduces FLOPs by 3.4Γ and accelerates visual encoding by 3.3Γ while retaining over 94% of the original performance, significantly outperforming current compression approaches.
π Abstract
Pure-vision GUI agents provide universal interaction capabilities but suffer from severe efficiency bottlenecks due to the massive spatiotemporal redundancy inherent in high-resolution screenshots and historical trajectories. We identify two critical misalignments in existing compression paradigms: the temporal mismatch, where uniform history encoding diverges from the agent's "fading memory" attention pattern, and the spatial topology conflict, where unstructured pruning compromises the grid integrity required for precise coordinate grounding, inducing spatial hallucinations. To address these challenges, we introduce GUIPruner, a training-free framework tailored for high-resolution GUI navigation. It synergizes Temporal-Adaptive Resolution (TAR), which eliminates historical redundancy via decay-based resizing, and Stratified Structure-aware Pruning (SSP), which prioritizes interactive foregrounds and semantic anchors while safeguarding global layout. Extensive evaluations across diverse benchmarks demonstrate that GUIPruner consistently achieves state-of-the-art performance, effectively preventing the collapse observed in large-scale models under high compression. Notably, on Qwen2-VL-2B, our method delivers a 3.4x reduction in FLOPs and a 3.3x speedup in vision encoding latency while retaining over 94% of the original performance, enabling real-time, high-precision navigation with minimal resource consumption.