FIPO: Eliciting Deep Reasoning with Future-KL Influenced Policy Optimization

📅 2026-03-20

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

Large language models often suffer from limited reasoning efficiency in reinforcement learning due to coarse-grained rewards that fail to distinguish critical reasoning steps from ordinary tokens. This work proposes Future-KL Influenced Policy Optimization (FIPO), which, for the first time, incorporates discounted future KL divergence into policy optimization to construct a dense advantage function, enabling fine-grained credit assignment based on each token’s influence on future trajectories. Evaluated on Qwen2.5-32B, FIPO increases the average chain-of-thought length from approximately 4,000 to over 10,000 tokens and improves AIME 2024 Pass@1 accuracy from 50.0% to 58.0%, surpassing both DeepSeek-R1-Zero-Math-32B and o1-mini and exceeding the performance ceiling of existing ORM-based methods.

Technology Category

Application Category

📝 Abstract

We present Future-KL Influenced Policy Optimization (FIPO), a reinforcement learning algorithm designed to overcome reasoning bottlenecks in large language models. While GRPO style training scales effectively, it typically relies on outcome-based rewards (ORM) that distribute a global advantage uniformly across every token in a trajectory. We argue that this coarse-grained credit assignment imposes a performance ceiling by failing to distinguish critical logical pivots from trivial tokens. FIPO addresses this by incorporating discounted future-KL divergence into the policy update, creating a dense advantage formulation that re-weights tokens based on their influence on subsequent trajectory behavior. Empirically, FIPO enables models to break through the length stagnation seen in standard baselines. Evaluated on Qwen2.5-32B, FIPO extends the average chain-of-thought length from roughly 4,000 to over 10,000 tokens and increases AIME 2024 Pass@1 accuracy from 50.0% to a peak of 58.0% (converging at approximately 56.0\%). This outperforms both DeepSeek-R1-Zero-Math-32B (around 47.0%) and o1-mini (approximately 56.0%). Our results suggest that establishing dense advantage formulations is a vital path for evolving ORM-based algorithms to unlock the full reasoning potential of base models. We open-source our training system, built on the verl framework.

Problem

Research questions and friction points this paper is trying to address.

reinforcement learning

credit assignment

reasoning bottlenecks

large language models

outcome-based rewards

Innovation

Methods, ideas, or system contributions that make the work stand out.

Future-KL

dense advantage

policy optimization