Training High-Level Schedulers with Execution-Feedback Reinforcement Learning for Long-Horizon GUI Automation

📅 2025-11-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
GUI agents face two key challenges in long-horizon tasks: (1) coupling between high-level planning and low-level execution leads to role ambiguity and responsibility conflicts; and (2) lack of explicit task state awareness results in progress loss. This paper proposes a phased execution-feedback reinforcement learning framework that innovatively decouples high-level scheduling from low-level execution via a Coordinator–Executor–State Tracker multi-agent architecture. The scheduler is designed to be generalizable and plug-and-play. Leveraging task decomposition, context compression, and state tracking techniques, the Coordinator and State Tracker are jointly trained using execution-feedback RL. This significantly improves long-horizon task completion rates and state consistency. Experiments demonstrate that the scheduler exhibits strong generalizability across diverse underlying executors and consistently enhances their performance.

Technology Category

Application Category

📝 Abstract
The rapid development of large vision-language model (VLM) has greatly promoted the research of GUI agent. However, GUI agents still face significant challenges in handling long-horizon tasks. First, single-agent models struggle to balance high-level capabilities and low-level execution capability, facing prevalent issues of responsibility coupling and capability conflicts. Second, agents lack awareness of the task state, leading to progress loss in long-horizon tasks. To address these challenges, we propose a staged execution-feedback reinforcement learning algorithm. Unlike training a unified policy model, we focus on training high-level scheduling models. Specifically, we propose and train two agents: a Coordinator, responsible for the strategic planning and task decomposition; and a State Tracker, responsible for context compression and information management to maintain the task's state and coherence. Based on this, we built the Coordinator-Executor-State Tracker (CES) multi-agent framework, which can be integrated with any low-level Executor model, assisting the Executor in solving long-horizon tasks through task scheduling and state management. Experiments on long-horizon task benchmarks demonstrate that CES significantly enhances the system's planning and state management capabilities. Furthermore, analysis confirms that our trained high-level scheduling module is a generalizable, plug-and-play module that significantly enhances the long-horizon capabilities of various Executors. Code can be available at https://github.com/hehehahi4/CES.
Problem

Research questions and friction points this paper is trying to address.

GUI agents struggle with long-horizon tasks due to capability conflicts.
Agents lack task state awareness causing progress loss in complex workflows.
Single-agent models face responsibility coupling in high-level planning and execution.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Staged execution-feedback reinforcement learning trains high-level scheduling models
Coordinator-State Tracker multi-agent framework enables task decomposition and state management
Plug-and-play high-level module enhances various low-level Executors for long-horizon tasks
🔎 Similar Papers
No similar papers found.
Z
Zehao Deng
School of Computer Science and Technology, Soochow University
Tianjie Ju
Tianjie Ju
Shanghai Jiao Tong University
Natural Langeuage Processing
Z
Zheng Wu
School of Computer Science, Shanghai Jiao Tong University
Zhuosheng Zhang
Zhuosheng Zhang
Assistant Professor at Shanghai Jiao Tong University
Natural Language ProcessingLarge Language ModelsReasoningAI SafetyMulti-Agent Learning
G
Gongshen Liu
School of Computer Science, Shanghai Jiao Tong University