BandPO: Bridging Trust Regions and Ratio Clipping via Probability-Aware Bounds for LLM Reinforcement Learning

📅 2026-03-05

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

This work addresses a key limitation of standard Proximal Policy Optimization (PPO)—its use of a fixed clipping threshold, which often overly suppresses low-probability yet high-advantage actions, leading to insufficient exploration and rapid policy entropy collapse. To overcome this, the authors propose BandPO, a novel approach that constructs a dynamic, probability-aware trust region via f-divergence and introduces a unified Band operator to map it into an adaptive clipping boundary. This design enhances exploration while preserving update stability, theoretically guarantees convergence to the global optimum, and admits closed-form solutions under specific divergence choices. Empirical results demonstrate that BandPO consistently outperforms both standard PPO and Clip-Higher across diverse models and tasks, effectively mitigating entropy collapse and improving policy performance.

Technology Category

Application Category

📝 Abstract

Proximal constraints are fundamental to the stability of the Large Language Model reinforcement learning. While the canonical clipping mechanism in PPO serves as an efficient surrogate for trust regions, we identify a critical bottleneck: fixed bounds strictly constrain the upward update margin of low-probability actions, disproportionately suppressing high-advantage tail strategies and inducing rapid entropy collapse. To address this, we introduce Band-constrained Policy Optimization (BandPO). BandPO replaces canonical clipping with Band, a unified theoretical operator that projects trust regions defined by f-divergences into dynamic, probability-aware clipping intervals. Theoretical analysis confirms that Band effectively resolves this exploration bottleneck. We formulate this mapping as a convex optimization problem, guaranteeing a globally optimal numerical solution while deriving closed-form solutions for specific divergences. Extensive experiments across diverse models and datasets demonstrate that BandPO consistently outperforms canonical clipping and Clip-Higher, while robustly mitigating entropy collapse.

Problem

Research questions and friction points this paper is trying to address.

trust regions

ratio clipping

entropy collapse

reinforcement learning

large language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

BandPO

probability-aware clipping

trust region