🤖 AI Summary
This work addresses the embodied competitive task of 3v3 multi-UAV volleyball, which poses challenges including long-horizon dependencies, strong inter-agent coupling, and underactuated quadrotor dynamics. We propose the Hierarchical Cooperative Self-Play (HCSP) framework, introducing a novel three-stage population training pipeline that jointly optimizes high-level strategic behaviors—such as emergent role switching and coordinated formations—and low-level agile control, all learned from scratch. HCSP adopts a “centralized high-level planning + decentralized low-level execution” architecture, integrating hierarchical reinforcement learning with multi-agent self-play. In simulation, HCSP achieves an average win rate of 82.9%, significantly outperforming non-hierarchical self-play (71.5%), rule-based baselines, and a two-stage ablation variant. Furthermore, the policy is successfully deployed on real quadrotor platforms. This work establishes a scalable paradigm for competitive, embodied multi-agent intelligence.
📝 Abstract
In this paper, we tackle the problem of learning to play 3v3 multi-drone volleyball, a new embodied competitive task that requires both high-level strategic coordination and low-level agile control. The task is turn-based, multi-agent, and physically grounded, posing significant challenges due to its long-horizon dependencies, tight inter-agent coupling, and the underactuated dynamics of quadrotors. To address this, we propose Hierarchical Co-Self-Play (HCSP), a hierarchical reinforcement learning framework that separates centralized high-level strategic decision-making from decentralized low-level motion control. We design a three-stage population-based training pipeline to enable both strategy and skill to emerge from scratch without expert demonstrations: (I) training diverse low-level skills, (II) learning high-level strategy via self-play with fixed low-level controllers, and (III) joint fine-tuning through co-self-play. Experiments show that HCSP achieves superior performance, outperforming non-hierarchical self-play and rule-based hierarchical baselines with an average 82.9% win rate and a 71.5% win rate against the two-stage variant. Moreover, co-self-play leads to emergent team behaviors such as role switching and coordinated formations, demonstrating the effectiveness of our hierarchical design and training scheme.