🤖 AI Summary
This work addresses critical system-level challenges in large-scale multi-agent reinforcement learning (MARL)—including synchronization bottlenecks between rollouts and training, load imbalance, and low resource utilization—by proposing FlexMARL, the first end-to-end LLM-driven MARL training framework. FlexMARL decouples the rollout and training components and introduces a micro-batch-driven asynchronous pipeline, hierarchical load balancing, agent-centric resource allocation, and location-agnostic communication. A novel joint scheduler is co-designed to enable efficient协同 optimization across the system. Evaluated on large-scale production clusters, FlexMARL achieves up to 7.3× faster training throughput and 5.6× higher hardware utilization compared to existing baselines.
📝 Abstract
Despite algorithm-level innovations for multi-agent reinforcement learning (MARL), the underlying networked infrastructure for large-scale MARL training remains underexplored. Existing training frameworks primarily optimize for single-agent scenarios and fail to address the unique system-level challenges of MARL, including rollout-training synchronization barriers, rollout load imbalance, and training resource underutilization. To bridge this gap, we propose FlexMARL, the first end-to-end training framework that holistically optimizes rollout, training, and their orchestration for large-scale LLM-based MARL. Specifically, FlexMARL introduces the joint orchestrator to manage data flow under the rollout-training disaggregated architecture. Building upon the experience store, a novel micro-batch driven asynchronous pipeline eliminates the synchronization barriers while providing strong consistency guarantees. Rollout engine adopts a parallel sampling scheme combined with hierarchical load balancing, which adapts to skewed inter/intra-agent request patterns. Training engine achieves on-demand hardware binding through agent-centric resource allocation. The training states of different agents are swapped via unified and location-agnostic communication. Empirical results on a large-scale production cluster demonstrate that FlexMARL achieves up to 7.3x speedup and improves hardware utilization by up to 5.6x compared to existing frameworks.