Hi-Agent: Hierarchical Vision-Language Agents for Mobile Device Control

📅 2025-10-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing mobile-device autonomous control agents predominantly rely on end-to-end state-action mapping, lacking structured reasoning and planning capabilities—leading to poor generalization on novel tasks and unseen UI layouts. To address this, we propose a hierarchical vision-language agent architecture that unifies high-level subgoal planning with low-level action execution. Specifically, we decompose long-horizon tasks into optimized single-step subgoal sequences via a lookahead advantage function, effectively mitigating path explosion. We further introduce an execution-feedback-driven joint training mechanism, eliminating the need for a separate critic module. Our approach integrates multimodal vision-language understanding, hierarchical reinforcement learning, and subgoal sequence modeling. Evaluated on Android-in-the-Wild, our method achieves 87.9% task success rate—substantially surpassing prior state-of-the-art. Moreover, it demonstrates strong zero-shot transferability and cross-application generalization on ScreenSpot-v2 and AndroidWorld.

Technology Category

Application Category

📝 Abstract
Building agents that autonomously operate mobile devices has attracted increasing attention. While Vision-Language Models (VLMs) show promise, most existing approaches rely on direct state-to-action mappings, which lack structured reasoning and planning, and thus generalize poorly to novel tasks or unseen UI layouts. We introduce Hi-Agent, a trainable hierarchical vision-language agent for mobile control, featuring a high-level reasoning model and a low-level action model that are jointly optimized. For efficient training, we reformulate multi-step decision-making as a sequence of single-step subgoals and propose a foresight advantage function, which leverages execution feedback from the low-level model to guide high-level optimization. This design alleviates the path explosion issue encountered by Group Relative Policy Optimization (GRPO) in long-horizon tasks and enables stable, critic-free joint training. Hi-Agent achieves a new State-Of-The-Art (SOTA) 87.9% task success rate on the Android-in-the-Wild (AitW) benchmark, significantly outperforming prior methods across three paradigms: prompt-based (AppAgent: 17.7%), supervised (Filtered BC: 54.5%), and reinforcement learning-based (DigiRL: 71.9%). It also demonstrates competitive zero-shot generalization on the ScreenSpot-v2 benchmark. On the more challenging AndroidWorld benchmark, Hi-Agent also scales effectively with larger backbones, showing strong adaptability in high-complexity mobile control scenarios.
Problem

Research questions and friction points this paper is trying to address.

Developing hierarchical agents for autonomous mobile device control
Addressing poor generalization in vision-language action mappings
Solving path explosion in long-horizon mobile control tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical agent with joint optimization of reasoning and action models
Foresight advantage function using low-level execution feedback
Reformulates multi-step decisions as single-step subgoals
Z
Zhe Wu
Tsinghua University
H
Hongjin Lu
Tsinghua University
J
Junliang Xing
Tsinghua University
C
Changhao Zhang
Tsinghua University
Y
Yin Zhu
Tsinghua University
Yuhao Yang
Yuhao Yang
University of Hong Kong
Large Language ModelsAgentic ModelsFoundation ModelsGraph Learning
Y
Yuheng Jing
Institute of Automation, Chinese Academy of Sciences
K
Kai Li
Institute of Automation, Chinese Academy of Sciences
Kun Shao
Kun Shao
Huawei
AI Agentreinforcement learningmulti-agent systemsembodied AIgame AI
Jianye Hao
Jianye Hao
Huawei Noah's Ark Lab/Tianjin University
Multiagent SystemsEmbodied AI
J
Jun Wang
University College London
Yuanchun Shi
Yuanchun Shi
Professor
human computer interaction