SWIRL: A Staged Workflow for Interleaved Reinforcement Learning in Mobile GUI Control

📅 2025-08-27

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

Existing single-agent approaches suffer from limited capability in modeling GUI structural semantics, while multi-agent reinforcement learning (MARL) enables capability decomposition but incurs inefficient training and poor compatibility with large vision-language models (LVLMs), hindering reliable natural language-to-GUI action mapping. To address these challenges, we propose SWIRL: a staged interleaved reinforcement learning framework that decomposes multi-agent collaboration into sequential single-agent tasks, thereby enhancing training stability and efficiency while guaranteeing theoretical safety, monotonic policy improvement, and reward convergence. SWIRL employs an LVLM-based Navigator-Interactor architecture to decouple high-level semantic planning from low-level action execution. Evaluated on both GUI control and multi-agent mathematical reasoning benchmarks, SWIRL achieves state-of-the-art performance, demonstrating superior training efficiency and strong generalization across diverse interactive tasks.

Technology Category

Application Category

📝 Abstract

The rapid advancement of large vision language models (LVLMs) and agent systems has heightened interest in mobile GUI agents that can reliably translate natural language into interface operations. Existing single-agent approaches, however, remain limited by structural constraints. Although multi-agent systems naturally decouple different competencies, recent progress in multi-agent reinforcement learning (MARL) has often been hindered by inefficiency and remains incompatible with current LVLM architectures. To address these challenges, we introduce SWIRL, a staged workflow for interleaved reinforcement learning designed for multi-agent systems. SWIRL reformulates MARL into a sequence of single-agent reinforcement learning tasks, updating one agent at a time while keeping the others fixed. This formulation enables stable training and promotes efficient coordination across agents. Theoretically, we provide a stepwise safety bound, a cross-round monotonic improvement theorem, and convergence guarantees on return, ensuring robust and principled optimization. In application to mobile GUI control, SWIRL instantiates a Navigator that converts language and screen context into structured plans, and an Interactor that grounds these plans into executable atomic actions. Extensive experiments demonstrate superior performance on both high-level and low-level GUI benchmarks. Beyond GUI tasks, SWIRL also demonstrates strong capability in multi-agent mathematical reasoning, underscoring its potential as a general framework for developing efficient and robust multi-agent systems.

Problem

Research questions and friction points this paper is trying to address.

Addresses inefficiency in multi-agent reinforcement learning systems

Reformulates MARL into sequential single-agent learning tasks

Enables stable training and coordination across multiple agents

Innovation

Methods, ideas, or system contributions that make the work stand out.

Staged workflow for multi-agent reinforcement learning

Reformulates MARL into sequential single-agent tasks

Navigator and Interactor agents for GUI control

🔎 Similar Papers

GUI Odyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices