Training High-Level Schedulers with Execution-Feedback Reinforcement Learning for Long-Horizon GUI Automation

📅 2025-11-27

📈 Citations: 0

✨ Influential: 0

career value

219K/year

🤖 AI Summary

GUI agents face two key challenges in long-horizon tasks: (1) coupling between high-level planning and low-level execution leads to role ambiguity and responsibility conflicts; and (2) lack of explicit task state awareness results in progress loss. This paper proposes a phased execution-feedback reinforcement learning framework that innovatively decouples high-level scheduling from low-level execution via a Coordinator–Executor–State Tracker multi-agent architecture. The scheduler is designed to be generalizable and plug-and-play. Leveraging task decomposition, context compression, and state tracking techniques, the Coordinator and State Tracker are jointly trained using execution-feedback RL. This significantly improves long-horizon task completion rates and state consistency. Experiments demonstrate that the scheduler exhibits strong generalizability across diverse underlying executors and consistently enhances their performance.

Technology Category

Application Category

📝 Abstract

The rapid development of large vision-language model (VLM) has greatly promoted the research of GUI agent. However, GUI agents still face significant challenges in handling long-horizon tasks. First, single-agent models struggle to balance high-level capabilities and low-level execution capability, facing prevalent issues of responsibility coupling and capability conflicts. Second, agents lack awareness of the task state, leading to progress loss in long-horizon tasks. To address these challenges, we propose a staged execution-feedback reinforcement learning algorithm. Unlike training a unified policy model, we focus on training high-level scheduling models. Specifically, we propose and train two agents: a Coordinator, responsible for the strategic planning and task decomposition; and a State Tracker, responsible for context compression and information management to maintain the task's state and coherence. Based on this, we built the Coordinator-Executor-State Tracker (CES) multi-agent framework, which can be integrated with any low-level Executor model, assisting the Executor in solving long-horizon tasks through task scheduling and state management. Experiments on long-horizon task benchmarks demonstrate that CES significantly enhances the system's planning and state management capabilities. Furthermore, analysis confirms that our trained high-level scheduling module is a generalizable, plug-and-play module that significantly enhances the long-horizon capabilities of various Executors. Code can be available at https://github.com/hehehahi4/CES.

Problem

Research questions and friction points this paper is trying to address.

GUI agents struggle with long-horizon tasks due to capability conflicts.

Agents lack task state awareness causing progress loss in complex workflows.

Single-agent models face responsibility coupling in high-level planning and execution.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Staged execution-feedback reinforcement learning trains high-level scheduling models

Coordinator-State Tracker multi-agent framework enables task decomposition and state management

Plug-and-play high-level module enhances various low-level Executors for long-horizon tasks

🔎 Similar Papers

No similar papers found.