Clipping-Free Policy Optimization for Large Language Models

📅 2026-01-30

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

This work addresses the issues of zero gradients, reward hacking, and training instability in reinforcement learning–based post-training of large language models caused by clipping mechanisms. The authors propose a clipping-free policy optimization method that replaces the conventional heuristic clipping with a convex quadratic penalty term derived from a total variation divergence constraint. This formulation ensures everywhere-differentiable and stable policy updates. Requiring only a one-line code substitution and no additional hyperparameters, the approach substantially improves training stability and alignment performance: it preserves downstream task performance while expanding the stable training regime in reasoning tasks, and effectively mitigates verbose outputs and capability degradation in alignment tasks, all while achieving competitive instruction-following performance.

Technology Category

Application Category

📝 Abstract

Reinforcement learning has become central to post-training large language models, yet dominant algorithms rely on clipping mechanisms that introduce optimization issues at scale, including zero-gradient regions, reward hacking, and training instability. We propose Clipping-Free Policy Optimization (CFPO), which replaces heuristic clipping with a convex quadratic penalty derived from Total Variation divergence constraints, yielding an everywhere-differentiable objective that enforces stable policy updates without hard boundaries. We evaluate CFPO across both reasoning and alignment settings. In reasoning, CFPO matches clipping-based methods on downstream benchmarks while extending the stable training regime. In alignment, CFPO mitigates verbosity exploitation and reduces capability degradation, while achieving competitive instruction-following performance. CFPO requires only a one-line code change and no additional hyperparameters. Our results suggest that CFPO is a promising drop-in alternative to clipping-based methods for LLM post-training.

Problem

Research questions and friction points this paper is trying to address.

clipping

reinforcement learning

large language models

training instability

reward hacking

Innovation

Methods, ideas, or system contributions that make the work stand out.

Clipping-Free Policy Optimization

Total Variation divergence

reinforcement learning