🤖 AI Summary
This work addresses the policy lag problem in asynchronous online off-policy reinforcement learning—where discrepancies between the behavior and learning policies arise due to distributed training and high update frequencies. The study is the first to systematically distinguish two distinct sources of such lag and proposes MethodAcronym, a constrained policy optimization approach based on total variation advantage alignment. By aligning the advantage function using a total variation metric and incorporating a tailored filtering mechanism, the method significantly enhances algorithmic stability, sample efficiency, and final performance. Empirical results demonstrate its strong robustness to policy lag across both classical reinforcement learning benchmarks and large language model–based mathematical reasoning tasks.
📝 Abstract
Distributed training and increasing the gradient update frequency are practical strategies to accelerate learning and improve performance, but both exacerbate a central challenge: \textit{policy lag}, which is the mismatch between the behavior policy generating data and the learning policy being updated. Policy lag can hinder the scaling of on-policy learning algorithms to larger problems. In this paper, we identify the sources of policy lag caused by distributed learning and high update frequency. We use the findings to propose \textit{total Variation-based Advantage aligned Constrained policy Optimization (\methodacronym)} as a practical approach to mitigate policy lag. We empirically validate our method and show that it offers better robustness to policy lag in classic RL tasks and a modern RL for LLM math reasoning task.