A Step Back: Prefix Importance Ratio Stabilizes Policy Optimization

📅 2026-01-30

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

This work addresses the instability and potential collapse in off-policy reinforcement learning fine-tuning of large language models (LLMs), which arises under significant policy shifts due to the use of token-level importance sampling ratios. The study is the first to identify that prefix-level importance ratios constitute the theoretically correct correction term. Building on this insight, the authors propose the Minimum Prefix Ratio Optimization (MinPRO) objective—a non-cumulative alternative that effectively mitigates instability under large policy divergences. By integrating prefix-level policy correction with a minimum token-ratio proxy, MinPRO substantially enhances training stability and peak performance across both dense and mixture-of-experts LLMs. Its efficacy and robustness are validated on multiple mathematical reasoning benchmarks.

Technology Category

Application Category

📝 Abstract

Reinforcement learning (RL) post-training has increasingly demonstrated strong ability to elicit reasoning behaviors in large language models (LLMs). For training efficiency, rollouts are typically generated in an off-policy manner using an older sampling policy and then used to update the current target policy. To correct the resulting discrepancy between the sampling and target policies, most existing RL objectives rely on a token-level importance sampling ratio, primarily due to its computational simplicity and numerical stability. However, we observe that token-level correction often leads to unstable training dynamics when the degree of off-policyness is large. In this paper, we revisit LLM policy optimization under off-policy conditions and show that the theoretically rigorous correction term is the prefix importance ratio, and that relaxing it to a token-level approximation can induce instability in RL post-training. To stabilize LLM optimization under large off-policy drift, we propose a simple yet effective objective, Minimum Prefix Ratio (MinPRO). MinPRO replaces the unstable cumulative prefix ratio with a non-cumulative surrogate based on the minimum token-level ratio observed in the preceding prefix. Extensive experiments on both dense and mixture-of-experts LLMs, across multiple mathematical reasoning benchmarks, demonstrate that MinPRO substantially improves training stability and peak performance in off-policy regimes.

Problem

Research questions and friction points this paper is trying to address.

off-policy reinforcement learning

large language models

policy optimization

training instability

importance sampling

Innovation

Methods, ideas, or system contributions that make the work stand out.

prefix importance ratio

off-policy reinforcement learning

MinPRO