GUI-KV: Efficient GUI Agents via KV Cache with Spatio-Temporal Awareness

📅 2025-10-01

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

To address the inference inefficiency and memory constraints of vision-language models (VLMs) when processing long sequences of high-resolution GUI screenshots, this paper proposes GUI-KV—the first plug-and-play, training-free KV cache compression method specifically designed for GUI agents. GUI-KV innovatively models GUI-specific spatial saliency and temporal redundancy, introducing saliency-guided pruning and cross-frame key subspace projection, while employing a unified budget allocation strategy to optimize attention cache compression across transformer layers. Evaluated on the five-screenshot setting of AgentNetBench, GUI-KV reduces decoding FLOPs by 38.9% and improves step accuracy by 4.1%, achieving performance nearly on par with the full-cache baseline. It significantly outperforms existing general-purpose KV compression methods, demonstrating superior efficacy in GUI-centric sequential reasoning tasks without architectural modification or model retraining.

Technology Category

Application Category

📝 Abstract

Graphical user interface (GUI) agents built on vision-language models have emerged as a promising approach to automate human-computer workflows. However, they also face the inefficiency challenge as they process long sequences of high-resolution screenshots and solving long-horizon tasks, making inference slow, costly and memory-bound. While key-value (KV) caching can mitigate this, storing the full cache is prohibitive for image-heavy contexts. Existing cache-compression methods are sub-optimal as they do not account for the spatial and temporal redundancy of GUIs. In this work, we first analyze attention patterns in GUI agent workloads and find that, unlike in natural images, attention sparsity is uniformly high across all transformer layers. This insight motivates a simple uniform budget allocation strategy, which we show empirically outperforms more complex layer-varying schemes. Building on this, we introduce GUI-KV, a plug-and-play KV cache compression method for GUI agents that requires no retraining. GUI-KV combines two novel techniques: (i) spatial saliency guidance, which augments attention scores with the L2 norm of hidden states to better preserve semantically important visual tokens, and (ii) temporal redundancy scoring, which projects previous frames' keys onto the current frame's key subspace to preferentially prune redundant history. Across standard GUI agent benchmarks and models, GUI-KV outperforms competitive KV compression baselines, closely matching full-cache accuracy at modest budgets. Notably, in a 5-screenshot setting on the AgentNetBench benchmark, GUI-KV reduces decoding FLOPs by 38.9% while increasing step accuracy by 4.1% over the full-cache baseline. These results demonstrate that exploiting GUI-specific redundancies enables efficient and reliable agent performance.

Problem

Research questions and friction points this paper is trying to address.

Optimizing KV cache compression for GUI agent efficiency

Addressing spatial-temporal redundancy in GUI attention patterns

Reducing computational costs while maintaining agent accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uniform budget allocation strategy for cache compression

Spatial saliency guidance using hidden states L2 norm

Temporal redundancy scoring via key subspace projection

🔎 Similar Papers

Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents