Near-Future Policy Optimization

📅 2026-04-22

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

This work addresses the challenge of obtaining high-quality off-policy trajectories with distributional proximity to improve the efficacy of Q/V signals in reinforcement learning. We propose Near-future Policy Optimization (NPO), which, for the first time, leverages policies from slightly later checkpoints within its own training trajectory as auxiliary behavior sources, thereby enhancing trajectory quality while preserving distributional closeness. Furthermore, we introduce an adaptive variant, AutoNPO, which automatically selects both the intervention timing and the guiding checkpoint. Integrated with the GRPO algorithm and combined with off-policy replay and Q/V estimation, NPO improves average performance on Qwen3-VL-8B-Instruct from 57.88 to 62.84, with AutoNPO further achieving 63.15, significantly accelerating convergence and elevating the performance ceiling.

Technology Category

Application Category

📝 Abstract

Reinforcement learning with verifiable rewards (RLVR) has become a core post-training recipe. Introducing suitable off-policy trajectories into on-policy exploration accelerates RLVR convergence and raises the performance ceiling, yet finding a source of such trajectories remains the key challenge. Existing mixed-policy methods either import trajectories from external teachers (high-quality but distributionally far) or replay past training trajectories (close but capped in quality), and neither simultaneously satisfies the strong enough (higher $Q$ , more new knowledge to learn) and close enough (lower $V$ , more readily absorbed) conditions required to maximize the effective learning signal $\mathcal{S} = Q/V$. We propose \textbf{N}ear-Future \textbf{P}olicy \textbf{O}ptimization (\textbf{NPO}), a simple mixed-policy scheme that learns from a policy's own near-future self: a later checkpoint from the same training run is a natural source of auxiliary trajectories that is both stronger than the current policy and closer than any external source, directly balancing trajectory quality against variance cost. We validate NPO through two manual interventions, early-stage bootstrapping and late-stage plateau breakthrough, and further propose \textbf{AutoNPO},an adaptive variant that automatically triggers interventions from online training signals and selects the guide checkpoint that maximizes $S$. On Qwen3-VL-8B-Instruct with GRPO, NPO improves average performance from 57.88 to 62.84, and AutoNPO pushes it to 63.15, raising the final performance ceiling while accelerating convergence.

Problem

Research questions and friction points this paper is trying to address.

Reinforcement Learning with Verifiable Rewards

Off-policy Trajectories

Mixed-policy Methods

Effective Learning Signal

Policy Optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Near-Future Policy Optimization

mixed-policy reinforcement learning

self-guided trajectory