🤖 AI Summary
Existing single-agent approaches suffer from limited capability in modeling GUI structural semantics, while multi-agent reinforcement learning (MARL) enables capability decomposition but incurs inefficient training and poor compatibility with large vision-language models (LVLMs), hindering reliable natural language-to-GUI action mapping. To address these challenges, we propose SWIRL: a staged interleaved reinforcement learning framework that decomposes multi-agent collaboration into sequential single-agent tasks, thereby enhancing training stability and efficiency while guaranteeing theoretical safety, monotonic policy improvement, and reward convergence. SWIRL employs an LVLM-based Navigator-Interactor architecture to decouple high-level semantic planning from low-level action execution. Evaluated on both GUI control and multi-agent mathematical reasoning benchmarks, SWIRL achieves state-of-the-art performance, demonstrating superior training efficiency and strong generalization across diverse interactive tasks.
📝 Abstract
The rapid advancement of large vision language models (LVLMs) and agent systems has heightened interest in mobile GUI agents that can reliably translate natural language into interface operations. Existing single-agent approaches, however, remain limited by structural constraints. Although multi-agent systems naturally decouple different competencies, recent progress in multi-agent reinforcement learning (MARL) has often been hindered by inefficiency and remains incompatible with current LVLM architectures. To address these challenges, we introduce SWIRL, a staged workflow for interleaved reinforcement learning designed for multi-agent systems. SWIRL reformulates MARL into a sequence of single-agent reinforcement learning tasks, updating one agent at a time while keeping the others fixed. This formulation enables stable training and promotes efficient coordination across agents. Theoretically, we provide a stepwise safety bound, a cross-round monotonic improvement theorem, and convergence guarantees on return, ensuring robust and principled optimization. In application to mobile GUI control, SWIRL instantiates a Navigator that converts language and screen context into structured plans, and an Interactor that grounds these plans into executable atomic actions. Extensive experiments demonstrate superior performance on both high-level and low-level GUI benchmarks. Beyond GUI tasks, SWIRL also demonstrates strong capability in multi-agent mathematical reasoning, underscoring its potential as a general framework for developing efficient and robust multi-agent systems.