🤖 AI Summary
Existing mobile agents are limited in complex, long-horizon tasks due to insufficient experience and underdeveloped skills. This work proposes K²-Agent, a hierarchical framework that, for the first time, decouples and co-evolves declarative knowledge (know-what) and procedural knowledge (know-how) to enable self-evolving high-level planning and autonomously generated low-level skills. The approach integrates an SRLR cycle—Summarize, Reflect, Locate, Revise—to refine high-level reasoning and introduces Curriculum-guided Group Relative Policy Optimization (C-GRPO) to train the low-level executor, augmented by decoupled rewards and dynamic demonstration injection. Using only raw screenshots and off-the-shelf foundation models, K²-Agent achieves a 76.1% success rate on AndroidWorld and demonstrates exceptional cross-task generalization on ScreenSpot-v2 and AitW.
📝 Abstract
Existing mobile device control agents often perform poorly when solving complex tasks requiring long-horizon planning and precise operations, typically due to a lack of relevant task experience or unfamiliarity with skill execution. We propose K2-Agent, a hierarchical framework that models human-like cognition by separating and co-evolving declarative (knowing what) and procedural (knowing how) knowledge for planning and execution. K2-Agent's high level reasoner is bootstrapped from a single demonstration per task and runs a Summarize-Reflect-Locate-Revise (SRLR) loop to distill and iteratively refine task-level declarative knowledge through self-evolution. The low-level executor is trained with our curriculum-guided Group Relative Policy Optimization (C-GRPO), which (i) constructs a balanced sample pool using decoupled reward signals and (ii) employs dynamic demonstration injection to guide the model in autonomously generating successful trajectories for training. On the challenging AndroidWorld benchmark, K2-Agent achieves a 76.1% success rate using only raw screenshots and open-source backbones. Furthermore, K2-Agent shows powerful dual generalization: its high-level declarative knowledge transfers across diverse base models, while its low-level procedural skills achieve competitive performance on unseen tasks in ScreenSpot-v2 and Android-in-the-Wild (AitW).