🤖 AI Summary
GUI agents face two key challenges in long-horizon tasks: (1) coupling between high-level planning and low-level execution leads to role ambiguity and responsibility conflicts; and (2) lack of explicit task state awareness results in progress loss. This paper proposes a phased execution-feedback reinforcement learning framework that innovatively decouples high-level scheduling from low-level execution via a Coordinator–Executor–State Tracker multi-agent architecture. The scheduler is designed to be generalizable and plug-and-play. Leveraging task decomposition, context compression, and state tracking techniques, the Coordinator and State Tracker are jointly trained using execution-feedback RL. This significantly improves long-horizon task completion rates and state consistency. Experiments demonstrate that the scheduler exhibits strong generalizability across diverse underlying executors and consistently enhances their performance.
📝 Abstract
The rapid development of large vision-language model (VLM) has greatly promoted the research of GUI agent. However, GUI agents still face significant challenges in handling long-horizon tasks. First, single-agent models struggle to balance high-level capabilities and low-level execution capability, facing prevalent issues of responsibility coupling and capability conflicts. Second, agents lack awareness of the task state, leading to progress loss in long-horizon tasks. To address these challenges, we propose a staged execution-feedback reinforcement learning algorithm. Unlike training a unified policy model, we focus on training high-level scheduling models. Specifically, we propose and train two agents: a Coordinator, responsible for the strategic planning and task decomposition; and a State Tracker, responsible for context compression and information management to maintain the task's state and coherence. Based on this, we built the Coordinator-Executor-State Tracker (CES) multi-agent framework, which can be integrated with any low-level Executor model, assisting the Executor in solving long-horizon tasks through task scheduling and state management. Experiments on long-horizon task benchmarks demonstrate that CES significantly enhances the system's planning and state management capabilities. Furthermore, analysis confirms that our trained high-level scheduling module is a generalizable, plug-and-play module that significantly enhances the long-horizon capabilities of various Executors. Code can be available at https://github.com/hehehahi4/CES.