RL in the Wild: Characterizing RLVR Training in LLM Deployment

📅 2025-09-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Reinforcement Learning from Verifiable Rewards (RLVR) training of large language models suffers from systemic bottlenecks—including GPU underutilization, low parallel efficiency, inefficient data management, and load imbalance—arising from complex data flows and heterogeneous tasks. Method: This work presents the first system-level workload characterization study for RLVR, introducing PolyTrace: a benchmark suite that integrates production deployment logs, sequence-length modeling, and dynamic task monitoring to accurately capture RLVR workload characteristics. Contribution/Results: PolyTrace achieves 94.7% fidelity in reproducing production RLVR workloads and precisely identifies four fundamental system bottlenecks. It establishes the first reproducible, verifiable benchmark for RLVR training systems, enabling principled performance optimization and architecture design.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) are now widely used across many domains. With their rapid development, Reinforcement Learning with Verifiable Rewards (RLVR) has surged in recent months to enhance their reasoning and understanding abilities. However, its complex data flows and diverse tasks pose substantial challenges to RL training systems, and there is limited understanding of RLVR from a system perspective. To thoroughly understand the system challenges introduced by RLVR, we present a characterization study of RLVR tasks in our LLM deployment. Specifically, we investigate the distribution and variation trends of workloads across different RL tasks across training steps. We identify issues such as GPU idling caused by skewed sequence length distribution, inefficient parallel strategies in dynamically varying workloads, inefficient data management mechanisms, and load imbalance. We describe our observations and call for further investigation into the remaining open challenges. Furthermore, we propose PolyTrace benchmark suite to conduct evaluation with realistic workloads, and a practical use case validates that PolyTrace benchmark suite exhibits 94.7% accuracy.
Problem

Research questions and friction points this paper is trying to address.

Characterizing system challenges in RLVR training for LLM deployment
Identifying GPU inefficiencies from skewed sequence length distributions
Addressing load imbalance and inefficient data management mechanisms
Innovation

Methods, ideas, or system contributions that make the work stand out.

Characterizing RLVR training workloads in LLM deployment
Identifying GPU idling and load imbalance issues
Proposing PolyTrace benchmark suite for evaluation
🔎 Similar Papers
No similar papers found.
J
Jiecheng Zhou
USTC
Q
Qinghao Hu
NTU
Y
Yuyang Jin
Tsinghua University
Z
Zerui Wang
USTC, Shanghai Jiao Tong University
P
Peng Sun
Unaffiliated
Yuzhe Gu
Yuzhe Gu
Shanghai Jiao Tong University
Large Language ModelScalable OversightKnowledge and Reasoning
Wenwei Zhang
Wenwei Zhang
Shanghai AI Laboratory
Large Language ModelScalable OversightArtificial Intelligence
Mingshu Zhai
Mingshu Zhai
Tsinghua University
SystemsMachine Learning
X
Xingcheng Zhang
USTC
W
Weiming Zhang
USTC