Align and Filter: Improving Performance in Asynchronous On-Policy RL

📅 2026-03-01

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

This work addresses the policy lag problem in asynchronous online off-policy reinforcement learning—where discrepancies between the behavior and learning policies arise due to distributed training and high update frequencies. The study is the first to systematically distinguish two distinct sources of such lag and proposes MethodAcronym, a constrained policy optimization approach based on total variation advantage alignment. By aligning the advantage function using a total variation metric and incorporating a tailored filtering mechanism, the method significantly enhances algorithmic stability, sample efficiency, and final performance. Empirical results demonstrate its strong robustness to policy lag across both classical reinforcement learning benchmarks and large language model–based mathematical reasoning tasks.

Technology Category

Application Category

📝 Abstract

Distributed training and increasing the gradient update frequency are practical strategies to accelerate learning and improve performance, but both exacerbate a central challenge: \textit{policy lag}, which is the mismatch between the behavior policy generating data and the learning policy being updated. Policy lag can hinder the scaling of on-policy learning algorithms to larger problems. In this paper, we identify the sources of policy lag caused by distributed learning and high update frequency. We use the findings to propose \textit{total Variation-based Advantage aligned Constrained policy Optimization (\methodacronym)} as a practical approach to mitigate policy lag. We empirically validate our method and show that it offers better robustness to policy lag in classic RL tasks and a modern RL for LLM math reasoning task.

Problem

Research questions and friction points this paper is trying to address.

policy lag

asynchronous on-policy RL

distributed training

gradient update frequency

behavior policy mismatch

Innovation

Methods, ideas, or system contributions that make the work stand out.

policy lag

asynchronous on-policy RL

total variation