Trajectory Balance with Asynchrony: Decoupling Exploration and Learning for Fast, Scalable LLM Post-Training

📅 2025-03-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In current LLM post-training, on-policy reinforcement learning struggles to integrate experience replay, hindering distributed exploration and scalability. To address this, we propose the Trajectory Balance Asynchrony (TBA) framework—the first to synergize the diversity-oriented Trajectory Balance (TB) objective—inspired by GFlowNets—with asynchronous off-policy data streaming, fully decoupling search from policy updates. TBA employs an asynchronous Actor-Critic architecture and a replay sampling mechanism that jointly accounts for reward magnitude and trajectory recency. This enables efficient training in ultra-large-scale, sparse-reward settings. Empirically, TBA significantly outperforms strong baselines on mathematical reasoning, preference alignment, and red-teaming tasks, achieving over 4× wall-clock speedup while simultaneously improving both generation diversity and task performance.

Technology Category

Application Category

📝 Abstract
Reinforcement learning (RL) is a critical component of large language model (LLM) post-training. However, existing on-policy algorithms used for post-training are inherently incompatible with the use of experience replay buffers, which can be populated scalably by distributed off-policy actors to enhance exploration as compute increases. We propose efficiently obtaining this benefit of replay buffers via Trajectory Balance with Asynchrony (TBA), a massively scalable LLM RL system. In contrast to existing approaches, TBA uses a larger fraction of compute on search, constantly generating off-policy data for a central replay buffer. A training node simultaneously samples data from this buffer based on reward or recency to update the policy using Trajectory Balance (TB), a diversity-seeking RL objective introduced for GFlowNets. TBA offers three key advantages: (1) decoupled training and search, speeding up training wall-clock time by 4x or more; (2) improved diversity through large-scale off-policy sampling; and (3) scalable search for sparse reward settings. On mathematical reasoning, preference-tuning, and automated red-teaming (diverse and representative post-training tasks), TBA produces speed and performance improvements over strong baselines.
Problem

Research questions and friction points this paper is trying to address.

Decoupling exploration and learning for scalable LLM post-training
Enhancing diversity through large-scale off-policy sampling
Improving speed and performance in sparse reward settings
Innovation

Methods, ideas, or system contributions that make the work stand out.

Decouples training and search for efficiency
Uses large-scale off-policy sampling for diversity
Applies Trajectory Balance for scalable RL
🔎 Similar Papers
2024-08-10AAAI Conference on Artificial IntelligenceCitations: 30