HiconAgent: History Context-aware Policy Optimization for GUI Agents

📅 2025-12-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
GUI agents face challenges in sequential navigation tasks due to redundant historical context and high computational overhead. To address this, we propose HCPO (History-Aware Contextual Policy Optimization), a novel framework that jointly integrates dynamic context sampling with anchor-guided history compression. Specifically, action anchors enable precise retention and structured modeling of critical historical information. Methodologically, HCPO introduces dynamic graph-structured sampling, a dual-branch policy network, and a history-enhanced alignment loss to significantly improve historical utilization efficiency. Experiments demonstrate that HiconAgent-3B—built upon HCPO—achieves state-of-the-art performance on GUI-Odyssey, outperforming GUI-R1-7B by 8.46% in task accuracy and 11.32% in step success rate. On AndroidControl and AITW, it matches or exceeds prior methods while accelerating inference by 2.47× and reducing computational cost by 60%.

Technology Category

Application Category

📝 Abstract
Graphical User Interface (GUI) agents require effective use of historical context to perform sequential navigation tasks. While incorporating past actions and observations can improve decision making, naive use of full history leads to excessive computational overhead and distraction from irrelevant information. To address this, we introduce HiconAgent, a GUI agent trained with History Context-aware Policy Optimization (HCPO) for efficient and effective utilization of historical information. HCPO optimizes history usage in both sampling and policy updates through two complementary components: (1) Dynamic Context Sampling (DCS) presents the agent with variable length histories during sampling, enabling adaptive use of the most relevant context; (2) Anchor-guided History Compression (AHC) refines the policy update phase with a dual branch strategy where the compressed branch removes history observations while keeping history actions as information flow anchors. The compressed and uncompressed branches are coupled through a history-enhanced alignment loss to enforce consistent history usage while maintaining efficiency. Experiments on mainstream GUI navigation benchmarks demonstrate strong performance. Despite being smaller, HiconAgent-3B outperforms GUI-R1-7B by +8.46 percent grounding accuracy and +11.32 percent step success rate on GUI-Odyssey, while achieving comparable results on AndroidControl and AITW with up to 2.47x computational speedup and 60 percent FLOPs reduction.
Problem

Research questions and friction points this paper is trying to address.

Optimizes GUI agents' use of historical context to reduce computational overhead
Addresses distraction from irrelevant information in sequential navigation tasks
Enhances decision-making efficiency while maintaining strong performance benchmarks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic Context Sampling for adaptive history usage
Anchor-guided History Compression with dual branch strategy
History-enhanced alignment loss for consistent and efficient policy
🔎 Similar Papers
No similar papers found.
X
Xurui Zhou
Harbin Institute of Technology, Shenzhen
G
Gongwei Chen
Harbin Institute of Technology, Shenzhen
Y
Yuquan Xie
Harbin Institute of Technology, Shenzhen
Zaijing Li
Zaijing Li
Harbin Institute of Technology, Shenzhen
Open-World AgentMultimodal Large Language ModelMultimodal Sentiment Analysis
K
Kaiwen Zhou
Huawei Noah’s Ark Lab
S
Shuai Wang
Huawei Noah’s Ark Lab
S
Shuo Yang
Harbin Institute of Technology, Shenzhen
Zhuotao Tian
Zhuotao Tian
Professor, Harbin Institute of Technology (Shenzhen)
Vision-language ModelMulti-modal PerceptionComputer Vision
Rui Shao
Rui Shao
Professor, Harbin Institute of Technology (Shenzhen)
Computer VisionMultimodal LLMEmbodied AI