Less is More: Empowering GUI Agent with Context-Aware Simplification

📅 2025-07-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
GUI agents transitioning to pure vision-based paradigms face two key challenges: (1) highly dense yet loosely correlated element contexts, and (2) severe redundancy in historical interactions—both impeding efficient context modeling. To address these, we propose SimpAgent, a context-aware simplification framework. First, it introduces a mask-driven element pruning mechanism that suppresses irrelevant visual interference without explicitly modeling complex inter-element relationships. Second, it incorporates a consistency-guided history compression module that explicitly enforces implicit, compact encoding of historical interactions by the large vision-language model. SimpAgent adopts an end-to-end purely visual architecture and achieves state-of-the-art performance across diverse web and mobile navigation benchmarks. It reduces FLOPs by 27% while improving accuracy and generalization—demonstrating unified gains in effectiveness, efficiency, and robustness.

Technology Category

Application Category

📝 Abstract
The research focus of GUI agents is shifting from text-dependent to pure-vision-based approaches, which, though promising, prioritize comprehensive pre-training data collection while neglecting contextual modeling challenges. We probe the characteristics of element and history contextual modeling in GUI agent and summarize: 1) the high-density and loose-relation of element context highlight the existence of many unrelated elements and their negative influence; 2) the high redundancy of history context reveals the inefficient history modeling in current GUI agents. In this work, we propose a context-aware simplification framework for building an efficient and effective GUI Agent, termed SimpAgent. To mitigate potential interference from numerous unrelated elements, we introduce a masking-based element pruning method that circumvents the intractable relation modeling through an efficient masking mechanism. To reduce the redundancy in historical information, we devise a consistency-guided history compression module, which enhances implicit LLM-based compression through innovative explicit guidance, achieving an optimal balance between performance and efficiency. With the above components, SimpAgent reduces 27% FLOPs and achieves superior GUI navigation performances. Comprehensive navigation experiments across diverse web and mobile environments demonstrate the effectiveness and potential of our agent.
Problem

Research questions and friction points this paper is trying to address.

Addressing high-density unrelated elements in GUI context modeling
Reducing redundancy in GUI agent history context modeling
Improving efficiency and performance in GUI navigation tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Masking-based element pruning method
Consistency-guided history compression module
Context-aware simplification framework
🔎 Similar Papers
No similar papers found.
G
Gongwei Chen
Harbin Institute of Technology, Shenzhen
X
Xurui Zhou
Harbin Institute of Technology, Shenzhen
Rui Shao
Rui Shao
Professor, Harbin Institute of Technology (Shenzhen)
Computer VisionMultimodal LLMEmbodied AI
Y
Yibo Lyu
Harbin Institute of Technology, Shenzhen
K
Kaiwen Zhou
Huawei Noah’s Ark Lab
S
Shuai Wang
Huawei Noah’s Ark Lab
W
Wentao Li
Huawei Noah’s Ark Lab
Yinchuan Li
Yinchuan Li
Principal Researcher, Noah's Ark Lab
Generative ModelsEmbodied AIArtificial Intelligence
Z
Zhongang Qi
Huawei Noah’s Ark Lab
L
Liqiang Nie
Harbin Institute of Technology, Shenzhen