🤖 AI Summary
GUI agents transitioning to pure vision-based paradigms face two key challenges: (1) highly dense yet loosely correlated element contexts, and (2) severe redundancy in historical interactions—both impeding efficient context modeling. To address these, we propose SimpAgent, a context-aware simplification framework. First, it introduces a mask-driven element pruning mechanism that suppresses irrelevant visual interference without explicitly modeling complex inter-element relationships. Second, it incorporates a consistency-guided history compression module that explicitly enforces implicit, compact encoding of historical interactions by the large vision-language model. SimpAgent adopts an end-to-end purely visual architecture and achieves state-of-the-art performance across diverse web and mobile navigation benchmarks. It reduces FLOPs by 27% while improving accuracy and generalization—demonstrating unified gains in effectiveness, efficiency, and robustness.
📝 Abstract
The research focus of GUI agents is shifting from text-dependent to pure-vision-based approaches, which, though promising, prioritize comprehensive pre-training data collection while neglecting contextual modeling challenges. We probe the characteristics of element and history contextual modeling in GUI agent and summarize: 1) the high-density and loose-relation of element context highlight the existence of many unrelated elements and their negative influence; 2) the high redundancy of history context reveals the inefficient history modeling in current GUI agents. In this work, we propose a context-aware simplification framework for building an efficient and effective GUI Agent, termed SimpAgent. To mitigate potential interference from numerous unrelated elements, we introduce a masking-based element pruning method that circumvents the intractable relation modeling through an efficient masking mechanism. To reduce the redundancy in historical information, we devise a consistency-guided history compression module, which enhances implicit LLM-based compression through innovative explicit guidance, achieving an optimal balance between performance and efficiency. With the above components, SimpAgent reduces 27% FLOPs and achieves superior GUI navigation performances. Comprehensive navigation experiments across diverse web and mobile environments demonstrate the effectiveness and potential of our agent.