Stabilizing Policy Optimization via Logits Convexity

📅 2026-03-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates the instability of reinforcement learning (RL) compared to supervised fine-tuning (SFT) in large language model training. From a gradient perspective, it identifies—for the first time—that the convexity of the logits space plays a pivotal role in the stability of policy optimization. Building on this insight, the authors propose the Logits Convex Optimization (LCO) framework, which aligns and optimizes objectives directly in the logits space by integrating convexity-aware alignment, gradient direction analysis, and enhanced policy gradient estimation. This approach substantially improves training stability without sacrificing performance. Extensive experiments across multiple model families and benchmarks demonstrate that LCO consistently outperforms conventional RL methods, achieving superior stability and competitive or better task performance.

Technology Category

Application Category

📝 Abstract
While reinforcement learning (RL) has been central to the recent success of large language models (LLMs), RL optimization is notoriously unstable, especially when compared to supervised fine-tuning (SFT). In this work, we investigate the stability gap between SFT and RL from a gradient-based perspective, and show that the convexity of the SFT loss with respect to model logits plays a key role in enabling stable training. Our theoretical analysis demonstrates that this property induces favorable gradient directionality during optimization. In contrast, Proximal Policy Optimization (PPO), a widely adopted policy gradient algorithm utilizing a clipped surrogate objective, lacks this stabilizing property. Motivated by this observation, we propose Logits Convex Optimization (LCO), a simple yet effective policy optimization framework that aligns the learned policy with an optimal target derived from the original RL objective, thereby emulating the stabilizing effects of logits-level convexity. Extensive experiments across multiple model families show that our LCO framework consistently improves training stability and outperforms conventional RL methods on a broad range of benchmarks.
Problem

Research questions and friction points this paper is trying to address.

reinforcement learning
training stability
policy optimization
logits convexity
large language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Logits Convexity
Policy Optimization
Reinforcement Learning
Training Stability
Proximal Policy Optimization
🔎 Similar Papers
2024-07-09Neural Information Processing SystemsCitations: 3
H
Hongzhan Chen
School of Computer Science and Engineering, Sun Yat-sen University, China and Shanghai Innovation Institute
T
Tao Yang
Wechat Search, Tencent Inc, China
Yuhua Zhu
Yuhua Zhu
Postdoctoral Fellow, Stanford University
applied and computational mathematicskinetic equationsreinforcement learning
S
Shiping Gao
School of Computer Science and Engineering, Sun Yat-sen University, China
Xiaojun Quan
Xiaojun Quan
Professor, School of Computer Science and Engineering, Sun Yat-sen University
natural language processingtext miningmachine learning
T
Ting Yao
Wechat Search, Tencent Inc, China