Rollout-Training Co-Design for Efficient LLM-Based Multi-Agent Reinforcement Learning

📅 2026-02-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses critical system-level challenges in large-scale multi-agent reinforcement learning (MARL)—including synchronization bottlenecks between rollouts and training, load imbalance, and low resource utilization—by proposing FlexMARL, the first end-to-end LLM-driven MARL training framework. FlexMARL decouples the rollout and training components and introduces a micro-batch-driven asynchronous pipeline, hierarchical load balancing, agent-centric resource allocation, and location-agnostic communication. A novel joint scheduler is co-designed to enable efficient协同 optimization across the system. Evaluated on large-scale production clusters, FlexMARL achieves up to 7.3× faster training throughput and 5.6× higher hardware utilization compared to existing baselines.

Technology Category

Application Category

📝 Abstract
Despite algorithm-level innovations for multi-agent reinforcement learning (MARL), the underlying networked infrastructure for large-scale MARL training remains underexplored. Existing training frameworks primarily optimize for single-agent scenarios and fail to address the unique system-level challenges of MARL, including rollout-training synchronization barriers, rollout load imbalance, and training resource underutilization. To bridge this gap, we propose FlexMARL, the first end-to-end training framework that holistically optimizes rollout, training, and their orchestration for large-scale LLM-based MARL. Specifically, FlexMARL introduces the joint orchestrator to manage data flow under the rollout-training disaggregated architecture. Building upon the experience store, a novel micro-batch driven asynchronous pipeline eliminates the synchronization barriers while providing strong consistency guarantees. Rollout engine adopts a parallel sampling scheme combined with hierarchical load balancing, which adapts to skewed inter/intra-agent request patterns. Training engine achieves on-demand hardware binding through agent-centric resource allocation. The training states of different agents are swapped via unified and location-agnostic communication. Empirical results on a large-scale production cluster demonstrate that FlexMARL achieves up to 7.3x speedup and improves hardware utilization by up to 5.6x compared to existing frameworks.
Problem

Research questions and friction points this paper is trying to address.

multi-agent reinforcement learning
rollout-training synchronization
load imbalance
resource underutilization
large-scale training
Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-agent reinforcement learning
LLM-based MARL
rollout-training co-design
asynchronous pipeline
hierarchical load balancing
🔎 Similar Papers
No similar papers found.
Z
Zhida Jiang
JD.com
Z
Zhaolong Xing
JD.com
J
Jiawei Lu
JD.com
Y
Yipei Niu
Huawei
Q
Qingyuan Sang
JD.com
L
Liangxu Zhang
JD.com
W
Wenquan Dai
Huawei
J
Junhua Shu
JD.com
J
Jiaxing Wang
JD.com
Q
Qiangyu Pei
Huawei
Q
Qiong Chen
Huawei
X
Xinyu Liu
Huawei
Fangming Liu
Fangming Liu
Professor, School of Computer Science & Technology, Huazhong University of Science & Technology
AI & Cloud ComputingDatacenterLLM SystemEdge ComputingGreen Computing
A
Ai Han
JD.com
Z
Zhen Chen
JD.com
K
Ke Zhang
JD.com