🤖 AI Summary
This work addresses the tension between inference-time plasticity and generalization stability in GRPO post-training by proposing a Bayesian probabilistic conflict resolution framework. It introduces a novel approach that models gradients as random variables and employs an uncertainty-aware “soft projection” mechanism to dynamically reconcile geometric conflicts between plasticity- and stability-oriented gradients. This formulation overcomes the limitations of conventional deterministic projection methods, which neglect the inherent stochasticity of gradients. By integrating Bayesian inference, probabilistic gradient modeling, and signal-to-noise ratio optimization, the method substantially smooths the training trajectory. Empirical results demonstrate consistent improvements over existing baselines across diverse reasoning tasks, effectively balancing model adaptability with stability.
📝 Abstract
Training stability remains a critical bottleneck for Group Relative Policy Optimization (GRPO), often manifesting as a trade-off between reasoning plasticity and general capability retention. We identify a root cause as the geometric conflict between plasticity and stability gradients, which leads to destructive interference. Crucially, we argue that deterministic projection methods are suboptimal for GRPO as they overlook the intrinsic stochasticity of group-based gradient estimates. To address this, we propose Probabilistic Conflict Resolution (PCR), a Bayesian framework that models gradients as random variables. PCR dynamically arbitrates conflicts via an uncertainty-aware ``soft projection''mechanism, optimizing the signal-to-noise ratio. Extensive experiments demonstrate that PCR significantly smooths the training trajectory and achieves superior performance in various reasoning tasks.