BandPO: Bridging Trust Regions and Ratio Clipping via Probability-Aware Bounds for LLM Reinforcement Learning

πŸ“… 2026-03-05
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses a key limitation of standard Proximal Policy Optimization (PPO)β€”its use of a fixed clipping threshold, which often overly suppresses low-probability yet high-advantage actions, leading to insufficient exploration and rapid policy entropy collapse. To overcome this, the authors propose BandPO, a novel approach that constructs a dynamic, probability-aware trust region via f-divergence and introduces a unified Band operator to map it into an adaptive clipping boundary. This design enhances exploration while preserving update stability, theoretically guarantees convergence to the global optimum, and admits closed-form solutions under specific divergence choices. Empirical results demonstrate that BandPO consistently outperforms both standard PPO and Clip-Higher across diverse models and tasks, effectively mitigating entropy collapse and improving policy performance.

Technology Category

Application Category

πŸ“ Abstract
Proximal constraints are fundamental to the stability of the Large Language Model reinforcement learning. While the canonical clipping mechanism in PPO serves as an efficient surrogate for trust regions, we identify a critical bottleneck: fixed bounds strictly constrain the upward update margin of low-probability actions, disproportionately suppressing high-advantage tail strategies and inducing rapid entropy collapse. To address this, we introduce Band-constrained Policy Optimization (BandPO). BandPO replaces canonical clipping with Band, a unified theoretical operator that projects trust regions defined by f-divergences into dynamic, probability-aware clipping intervals. Theoretical analysis confirms that Band effectively resolves this exploration bottleneck. We formulate this mapping as a convex optimization problem, guaranteeing a globally optimal numerical solution while deriving closed-form solutions for specific divergences. Extensive experiments across diverse models and datasets demonstrate that BandPO consistently outperforms canonical clipping and Clip-Higher, while robustly mitigating entropy collapse.
Problem

Research questions and friction points this paper is trying to address.

trust regions
ratio clipping
entropy collapse
reinforcement learning
large language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

BandPO
probability-aware clipping
trust region
entropy collapse
f-divergence
πŸ”Ž Similar Papers
No similar papers found.
Y
Yuan Li
Fudan University
B
Bo Wang
Fudan University
Yufei Gao
Yufei Gao
Zhengzhou University
Machine learningMedical Image Analysis
Y
Yuqian Yao
Fudan University
Xinyuan Wang
Xinyuan Wang
Fudan University
AI SafetyComputer VisionManifold LearningReinforcement LearningMulti-objective Optimization
Z
Zhangyue Yin
Fudan University
X
Xipeng Qiu
Fudan University