Clipping-Free Policy Optimization for Large Language Models

📅 2026-01-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the issues of zero gradients, reward hacking, and training instability in reinforcement learning–based post-training of large language models caused by clipping mechanisms. The authors propose a clipping-free policy optimization method that replaces the conventional heuristic clipping with a convex quadratic penalty term derived from a total variation divergence constraint. This formulation ensures everywhere-differentiable and stable policy updates. Requiring only a one-line code substitution and no additional hyperparameters, the approach substantially improves training stability and alignment performance: it preserves downstream task performance while expanding the stable training regime in reasoning tasks, and effectively mitigates verbose outputs and capability degradation in alignment tasks, all while achieving competitive instruction-following performance.

Technology Category

Application Category

📝 Abstract
Reinforcement learning has become central to post-training large language models, yet dominant algorithms rely on clipping mechanisms that introduce optimization issues at scale, including zero-gradient regions, reward hacking, and training instability. We propose Clipping-Free Policy Optimization (CFPO), which replaces heuristic clipping with a convex quadratic penalty derived from Total Variation divergence constraints, yielding an everywhere-differentiable objective that enforces stable policy updates without hard boundaries. We evaluate CFPO across both reasoning and alignment settings. In reasoning, CFPO matches clipping-based methods on downstream benchmarks while extending the stable training regime. In alignment, CFPO mitigates verbosity exploitation and reduces capability degradation, while achieving competitive instruction-following performance. CFPO requires only a one-line code change and no additional hyperparameters. Our results suggest that CFPO is a promising drop-in alternative to clipping-based methods for LLM post-training.
Problem

Research questions and friction points this paper is trying to address.

clipping
reinforcement learning
large language models
training instability
reward hacking
Innovation

Methods, ideas, or system contributions that make the work stand out.

Clipping-Free Policy Optimization
Total Variation divergence
reinforcement learning
large language models
stable policy updates
🔎 Similar Papers
No similar papers found.
Ö
Ömer Veysel Çağatan
KUIS AI Center, Koç University, Istanbul, Türkiye
B
Barış Akgün
KUIS AI Center, Koç University, Istanbul, Türkiye; Koç University, Istanbul, Türkiye
Gözde Gül Şahin
Gözde Gül Şahin
Assistant Professor, Koç University
Natural Language ProcessingMachine LearningSemanticsLow-resource NLP
Xuandong Zhao
Xuandong Zhao
UC Berkeley
Machine LearningNatural Language ProcessingAI Safety