🤖 AI Summary
This work addresses the issues of zero gradients, reward hacking, and training instability in reinforcement learning–based post-training of large language models caused by clipping mechanisms. The authors propose a clipping-free policy optimization method that replaces the conventional heuristic clipping with a convex quadratic penalty term derived from a total variation divergence constraint. This formulation ensures everywhere-differentiable and stable policy updates. Requiring only a one-line code substitution and no additional hyperparameters, the approach substantially improves training stability and alignment performance: it preserves downstream task performance while expanding the stable training regime in reasoning tasks, and effectively mitigates verbose outputs and capability degradation in alignment tasks, all while achieving competitive instruction-following performance.
📝 Abstract
Reinforcement learning has become central to post-training large language models, yet dominant algorithms rely on clipping mechanisms that introduce optimization issues at scale, including zero-gradient regions, reward hacking, and training instability. We propose Clipping-Free Policy Optimization (CFPO), which replaces heuristic clipping with a convex quadratic penalty derived from Total Variation divergence constraints, yielding an everywhere-differentiable objective that enforces stable policy updates without hard boundaries. We evaluate CFPO across both reasoning and alignment settings. In reasoning, CFPO matches clipping-based methods on downstream benchmarks while extending the stable training regime. In alignment, CFPO mitigates verbosity exploitation and reduces capability degradation, while achieving competitive instruction-following performance. CFPO requires only a one-line code change and no additional hyperparameters. Our results suggest that CFPO is a promising drop-in alternative to clipping-based methods for LLM post-training.