Future Policy Aware Preference Learning for Mathematical Reasoning

📅 2025-09-24

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

In mathematical reasoning, preference learning methods (e.g., DPO) suffer from performance degradation due to excessive penalization of shared, semantically useful tokens between preferred and dispreferred trajectories—arising from their high token-level overlap. To address this, we propose Future Policy-Aware Preference Optimization (FPAPO): a lightweight extrapolation in the logit space of the current model using the reference model to estimate a “future policy,” which replaces the current policy in standard frameworks (e.g., DPO, RPO, SimPER). This introduces foresighted regularization that proactively adjusts gradients, mitigating over-suppression of beneficial tokens. FPAPO requires no additional trajectory sampling and incurs negligible computational overhead. Evaluated on MATH and GSM8K, it consistently improves performance over strong baselines—achieving up to a 5.75% absolute gain—and enables stable, extended training.

Technology Category

Application Category

📝 Abstract

Preference learning methods such as Direct Preference Optimization (DPO) have become standard for Large Language Model (LLM) post-training, yet they are often ineffective for mathematical reasoning. A key challenge is the large token overlap between preferred and dispreferred trajectories; lowering the probability of dispreferred trajectories also reduces the probability of shared useful tokens, leading to over-penalization and overall performance collapse. As a mitigation, existing algorithms include the probability of a trajectory under the current policy as a regularization term, which decreases the effect of the gradient when the probability is low. However, by the time this effect takes hold, useful tokens may have already been over-penalized as the model has begun to degrade. To address this, we propose Future Policy Aware (FPA) preference learning, which replaces the current policy with a future policy in the regularization term. This future policy is estimated via lightweight, logit-space extrapolation from a reference model toward the current model. FPA enables safer training by preemptively regularizing potentially problematic gradients. We apply FPA to DPO, RPO, and SimPER and evaluate them on the MATH and GSM8K benchmarks. FPA yields consistent performance gains, with the largest improvements observed with SimPER, achieving gains of up to 5.75%. We demonstrate that FPA provides proactive regularization while preserving the probability of shared, useful mathematical tokens, and enables longer, degradation-free training with negligible computational overhead. We will release our code publicly upon publication.

Problem

Research questions and friction points this paper is trying to address.

Existing preference learning methods over-penalize shared tokens in mathematical reasoning

Current policy regularization fails to prevent early degradation during training

Mathematical reasoning suffers from performance collapse due to token overlap

Innovation

Methods, ideas, or system contributions that make the work stand out.

Replaces current policy with future policy regularization

Uses logit-space extrapolation from reference model

Proactively regularizes gradients to prevent token over-penalization

🔎 Similar Papers

Token-Supervised Value Models for Enhancing Mathematical Reasoning Capabilities of Large Language Models