Not All Preferences Are Created Equal: Stability-Aware and Gradient-Efficient Alignment for Reasoning Models

📅 2026-02-01

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

This work addresses the inefficiency and instability of conventional preference alignment methods, such as Direct Preference Optimization (DPO), which treat all training samples uniformly and ignore their dynamic utility during optimization. To overcome this limitation, the authors propose SAGE, a novel framework that integrates dynamic curriculum learning with a stability-aware scoring function to prioritize informative and high-confidence misaligned samples while effectively filtering out noisy data. By combining coarse-grained curriculum scheduling with fine-grained signal-to-noise ratio optimization, SAGE transcends the constraints of static weighting strategies, substantially enhancing both alignment efficiency and training stability. Experimental results demonstrate that SAGE accelerates convergence and achieves superior performance over existing static baselines across multiple mathematical reasoning benchmarks.

Technology Category

Application Category

📝 Abstract

Preference-based alignment is pivotal for training large reasoning models; however, standard methods like Direct Preference Optimization (DPO) typically treat all preference pairs uniformly, overlooking the evolving utility of training instances. This static approach often leads to inefficient or unstable optimization, as it wastes computation on trivial pairs with negligible gradients and suffers from noise induced by samples near uncertain decision boundaries. Facing these challenges, we propose SAGE (Stability-Aware Gradient Efficiency), a dynamic framework designed to enhance alignment reliability by maximizing the Signal-to-Noise Ratio of policy updates. Concretely, SAGE integrates a coarse-grained curriculum mechanism that refreshes candidate pools based on model competence with a fine-grained, stability-aware scoring function that prioritizes informative, confident errors while filtering out unstable samples. Experiments on multiple mathematical reasoning benchmarks demonstrate that SAGE significantly accelerates convergence and outperforms static baselines, highlighting the critical role of policy-aware, stability-conscious data selection in reasoning alignment.

Problem

Research questions and friction points this paper is trying to address.

preference-based alignment

reasoning models

optimization stability

gradient efficiency

training dynamics

Innovation

Methods, ideas, or system contributions that make the work stand out.

preference alignment

gradient efficiency

stability-aware learning