Rollout-Training Co-Design for Efficient LLM-Based Multi-Agent Reinforcement Learning

📅 2026-02-10

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

This work addresses critical system-level challenges in large-scale multi-agent reinforcement learning (MARL)—including synchronization bottlenecks between rollouts and training, load imbalance, and low resource utilization—by proposing FlexMARL, the first end-to-end LLM-driven MARL training framework. FlexMARL decouples the rollout and training components and introduces a micro-batch-driven asynchronous pipeline, hierarchical load balancing, agent-centric resource allocation, and location-agnostic communication. A novel joint scheduler is co-designed to enable efficient协同 optimization across the system. Evaluated on large-scale production clusters, FlexMARL achieves up to 7.3× faster training throughput and 5.6× higher hardware utilization compared to existing baselines.

Technology Category

Application Category

📝 Abstract

Despite algorithm-level innovations for multi-agent reinforcement learning (MARL), the underlying networked infrastructure for large-scale MARL training remains underexplored. Existing training frameworks primarily optimize for single-agent scenarios and fail to address the unique system-level challenges of MARL, including rollout-training synchronization barriers, rollout load imbalance, and training resource underutilization. To bridge this gap, we propose FlexMARL, the first end-to-end training framework that holistically optimizes rollout, training, and their orchestration for large-scale LLM-based MARL. Specifically, FlexMARL introduces the joint orchestrator to manage data flow under the rollout-training disaggregated architecture. Building upon the experience store, a novel micro-batch driven asynchronous pipeline eliminates the synchronization barriers while providing strong consistency guarantees. Rollout engine adopts a parallel sampling scheme combined with hierarchical load balancing, which adapts to skewed inter/intra-agent request patterns. Training engine achieves on-demand hardware binding through agent-centric resource allocation. The training states of different agents are swapped via unified and location-agnostic communication. Empirical results on a large-scale production cluster demonstrate that FlexMARL achieves up to 7.3x speedup and improves hardware utilization by up to 5.6x compared to existing frameworks.

Problem

Research questions and friction points this paper is trying to address.

multi-agent reinforcement learning

rollout-training synchronization

load imbalance

resource underutilization

large-scale training

Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-agent reinforcement learning

LLM-based MARL

rollout-training co-design