Hi-Agent: Hierarchical Vision-Language Agents for Mobile Device Control

📅 2025-10-16

📈 Citations: 0

✨ Influential: 0

career value

231K/year

🤖 AI Summary

Existing mobile-device autonomous control agents predominantly rely on end-to-end state-action mapping, lacking structured reasoning and planning capabilities—leading to poor generalization on novel tasks and unseen UI layouts. To address this, we propose a hierarchical vision-language agent architecture that unifies high-level subgoal planning with low-level action execution. Specifically, we decompose long-horizon tasks into optimized single-step subgoal sequences via a lookahead advantage function, effectively mitigating path explosion. We further introduce an execution-feedback-driven joint training mechanism, eliminating the need for a separate critic module. Our approach integrates multimodal vision-language understanding, hierarchical reinforcement learning, and subgoal sequence modeling. Evaluated on Android-in-the-Wild, our method achieves 87.9% task success rate—substantially surpassing prior state-of-the-art. Moreover, it demonstrates strong zero-shot transferability and cross-application generalization on ScreenSpot-v2 and AndroidWorld.

Technology Category

Application Category

📝 Abstract

Building agents that autonomously operate mobile devices has attracted increasing attention. While Vision-Language Models (VLMs) show promise, most existing approaches rely on direct state-to-action mappings, which lack structured reasoning and planning, and thus generalize poorly to novel tasks or unseen UI layouts. We introduce Hi-Agent, a trainable hierarchical vision-language agent for mobile control, featuring a high-level reasoning model and a low-level action model that are jointly optimized. For efficient training, we reformulate multi-step decision-making as a sequence of single-step subgoals and propose a foresight advantage function, which leverages execution feedback from the low-level model to guide high-level optimization. This design alleviates the path explosion issue encountered by Group Relative Policy Optimization (GRPO) in long-horizon tasks and enables stable, critic-free joint training. Hi-Agent achieves a new State-Of-The-Art (SOTA) 87.9% task success rate on the Android-in-the-Wild (AitW) benchmark, significantly outperforming prior methods across three paradigms: prompt-based (AppAgent: 17.7%), supervised (Filtered BC: 54.5%), and reinforcement learning-based (DigiRL: 71.9%). It also demonstrates competitive zero-shot generalization on the ScreenSpot-v2 benchmark. On the more challenging AndroidWorld benchmark, Hi-Agent also scales effectively with larger backbones, showing strong adaptability in high-complexity mobile control scenarios.

Problem

Research questions and friction points this paper is trying to address.

Developing hierarchical agents for autonomous mobile device control

Addressing poor generalization in vision-language action mappings

Solving path explosion in long-horizon mobile control tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical agent with joint optimization of reasoning and action models

Foresight advantage function using low-level execution feedback

Reformulates multi-step decisions as single-step subgoals

🔎 Similar Papers

Benchmarking Mobile Device Control Agents across Diverse Configurations