Asynchronous RLHF: Faster and More Efficient Off-Policy RL for Language Models

📅 2024-10-23
🏛️ arXiv.org
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the trade-off between off-policy data tolerance and learning efficiency in asynchronous Reinforcement Learning from Human Feedback (RLHF). We propose an online-offline policy training framework that decouples sample generation from policy updates. Methodologically, we integrate LLM policy iteration, real-time reward modeling, and historical batch replay into an asynchronous Direct Preference Optimization (DPO) pipeline. Crucially, we are the first to systematically investigate and empirically validate the strong robustness of online DPO under asynchronous, delayed-feedback settings—robustness that scales positively with model size. Experiments demonstrate significant acceleration: 40% faster instruction tuning on LLaMA-3.1-8B and 70% speedup on GSM8K for Rho-1B, all while strictly matching the accuracy of synchronous training. Our approach substantially improves computational efficiency and training scalability without compromising performance.

Technology Category

Application Category

📝 Abstract
The dominant paradigm for RLHF is online and on-policy RL: synchronously generating from the large language model (LLM) policy, labelling with a reward model, and learning using feedback on the LLM's own outputs. While performant, this paradigm is computationally inefficient. Inspired by classical deep RL literature, we propose separating generation and learning in RLHF. This enables asynchronous generation of new samples while simultaneously training on old samples, leading to faster training and more compute-optimal scaling. However, asynchronous training relies on an underexplored regime, online but off-policy RLHF: learning on samples from previous iterations of our model which give a worse training signal. We tackle the fundamental challenge in this regime: how much off-policyness can we tolerate for asynchronous training to speed up learning but maintain performance? Among several RLHF algorithms we test, online DPO is found to be most robust to off-policy data, and robustness increases with the scale of the policy model. We study further compute optimizations for asynchronous RLHF but find that they come at a performance cost, giving rise to a trade-off. We verify the scalability of asynchronous RLHF by training a general-purpose chatbot from LLaMA 3.1 8B on an instruction-following task ~40% faster than a synchronous run while matching final performance. Finally, we extend our results to math and reasoning to demonstrate asynchronous RL can finetune Rho 1B on GSM8k ~70% faster while matching synchronous accuracy.
Problem

Research questions and friction points this paper is trying to address.

Improves computational efficiency in RLHF by asynchronous generation and learning.
Explores tolerance for off-policy data to speed up training without losing performance.
Demonstrates faster training for language models on instruction-following and math tasks.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Asynchronous generation and learning separation
Online but off-policy RLHF for efficiency
Robustness to off-policy data increases with scale
🔎 Similar Papers