π€ AI Summary
This work addresses the challenge of context inflation and excessive computational overhead in GUI agents caused by redundant image tokens when processing visual histories, which severely limits their ability to model long interaction trajectories. The authors propose a learnable patch selector that dynamically removes redundant visual patches through cross-frame redundancy detection while preserving spatial layout via structure-aware token compression. Their analysis reveals, for the first time, that performance saturation in visual history modeling stems from inefficient token representations rather than inherent information redundancy. Built upon multimodal language models such as Qwen2.5-VL-7B, the method achieves efficient temporal modeling, reducing visual tokens by 46% on average across three benchmarks while improving task success rates by 3%, thereby significantly enhancing the agentβs capacity to leverage extended interaction histories.
π Abstract
Computer-use agents~(CUAs) rely on visual observations of graphical user interfaces, where each screenshot is encoded into a large number of visual tokens. As interaction trajectories grow, the token cost increases rapidly, limiting the amount of history that can be incorporated under fixed context and compute budgets. This has resulted in no or very limited improvement in the performance when using history unlike other domains. We address this inefficiency by introducing ReVision, which is used to train multimodal language models on trajectories where redundant visual patches are removed using a learned patch selector that compares patch representations across consecutive screenshots while preserving spatial structure required by the model. Across three benchmarks, OSWorld, WebTailBench, and AgentNetBench, when processing trajectories with 5 history screenshots using Qwen2.5-VL-7B, ReVision reduces token usage by approximately 46% on average while improving success rate by 3% over the no drop baseline. This establishes a clear efficiency gain, enabling agents to process longer trajectories with fewer tokens. With this improved efficiency, we revisit the role of history in CUAs and find that performance continues to improve as more past observations are incorporated when redundancy is removed. This suggests that the commonly observed saturation in visual history is not due to limited usefulness of past information, but rather a consequence of inefficient token representations.