🤖 AI Summary
In current LLM post-training, on-policy reinforcement learning struggles to integrate experience replay, hindering distributed exploration and scalability. To address this, we propose the Trajectory Balance Asynchrony (TBA) framework—the first to synergize the diversity-oriented Trajectory Balance (TB) objective—inspired by GFlowNets—with asynchronous off-policy data streaming, fully decoupling search from policy updates. TBA employs an asynchronous Actor-Critic architecture and a replay sampling mechanism that jointly accounts for reward magnitude and trajectory recency. This enables efficient training in ultra-large-scale, sparse-reward settings. Empirically, TBA significantly outperforms strong baselines on mathematical reasoning, preference alignment, and red-teaming tasks, achieving over 4× wall-clock speedup while simultaneously improving both generation diversity and task performance.
📝 Abstract
Reinforcement learning (RL) is a critical component of large language model (LLM) post-training. However, existing on-policy algorithms used for post-training are inherently incompatible with the use of experience replay buffers, which can be populated scalably by distributed off-policy actors to enhance exploration as compute increases. We propose efficiently obtaining this benefit of replay buffers via Trajectory Balance with Asynchrony (TBA), a massively scalable LLM RL system. In contrast to existing approaches, TBA uses a larger fraction of compute on search, constantly generating off-policy data for a central replay buffer. A training node simultaneously samples data from this buffer based on reward or recency to update the policy using Trajectory Balance (TB), a diversity-seeking RL objective introduced for GFlowNets. TBA offers three key advantages: (1) decoupled training and search, speeding up training wall-clock time by 4x or more; (2) improved diversity through large-scale off-policy sampling; and (3) scalable search for sparse reward settings. On mathematical reasoning, preference-tuning, and automated red-teaming (diverse and representative post-training tasks), TBA produces speed and performance improvements over strong baselines.