RLAX: Large-Scale, Distributed Reinforcement Learning for Large Language Models on TPUs

πŸ“… 2025-12-06
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address the challenge of scaling reinforcement learning (RL) for large language model (LLM) inference in massive distributed environments, this paper proposes RLAXβ€”a scalable RL framework designed for TPU clusters. RLAX employs a parameter-server architecture with preemptible training and dynamic fault recovery. It integrates three key innovations: efficient synthetic data construction, a multi-algorithm-compatible distributed RL training pipeline, and fine-grained model weight synchronization. Evaluated on 1,024 v5p TPUs, RLAX improves the pass@8 accuracy of QwQ-32B by 12.8% in just 12 hours and 48 minutes, achieving significantly accelerated convergence and high training robustness. RLAX delivers a system-level solution for efficient, stable, and large-scale RL-based alignment of LLMs.

Technology Category

Application Category

πŸ“ Abstract
Reinforcement learning (RL) has emerged as the de-facto paradigm for improving the reasoning capabilities of large language models (LLMs). We have developed RLAX, a scalable RL framework on TPUs. RLAX employs a parameter-server architecture. A master trainer periodically pushes updated model weights to the parameter server while a fleet of inference workers pull the latest weights and generates new rollouts. We introduce a suite of system techniques to enable scalable and preemptible RL for a diverse set of state-of-art RL algorithms. To accelerate convergence and improve model quality, we have devised new dataset curation and alignment techniques. Large-scale evaluations show that RLAX improves QwQ-32B's pass@8 accuracy by 12.8% in just 12 hours 48 minutes on 1024 v5p TPUs, while remaining robust to preemptions during training.
Problem

Research questions and friction points this paper is trying to address.

Develop scalable RL framework for large language models
Enable efficient distributed training on TPU hardware
Improve model accuracy with new data techniques
Innovation

Methods, ideas, or system contributions that make the work stand out.

Distributed parameter-server architecture on TPUs
Scalable preemptible RL for diverse algorithms
Dataset curation and alignment techniques
πŸ”Ž Similar Papers
R
Runlong Zhou
Apple
L
Lefan Zhang
Apple
S
Shang-Chen Wu
Apple
K
Kelvin Zou
Apple
H
Hanzhi Zhou
Apple
K
Ke Ye
Apple
Yihao Feng
Yihao Feng
Apple AIML
Machine LearningReinforcement Learning
D
Dong Yin
Apple
A
Alex Guillen Garcia
Apple
D
Dmytro Babych
Apple
R
Rohit Chatterjee
Apple
M
Matthew Hopkins
Apple
Xiang Kong
Xiang Kong
Carnegie Mellon University
natural language processingdeep learning
Chang Lan
Chang Lan
Apple AIML
Machine LearningDistributed Systems
L
Lezhi Li
Apple
Yiping Ma
Yiping Ma
UPenn, UC Berkeley
securitycryptographysystems
D
Daniele Molinari
Apple
S
Senyu Tong
Apple
Yanchao Sun
Yanchao Sun
Apple AI/ML
foundation modelsmachine learningreinforcement learning
T
Thomas Voice
Apple
J
Jianyu Wang
Apple
C
Chong Wang
Apple
S
Simon Wang
Apple
Floris Weers
Floris Weers
Apple
Y
Yechen Xu
University of Washington