🤖 AI Summary
In reinforcement learning for large language models, policy gradient methods suffer from approximation errors that scale quadratically with sequence length $T$ ($O(T^2)$) due to policy mismatch, and conventional trust-region methods fail on long-horizon tasks. Method: We derive the first tight trust-region error bounds of $O(T)$ and $O(T^{3/2})$, and propose Trajectory-level Regularization Masking (TRM), a trajectory-wise masking mechanism that dynamically discards entire sampled trajectories based on the maximum token-level KL divergence within each trajectory, monitored via full-sequence KL divergence. Contribution/Results: TRM establishes the first sequence-level policy optimization framework with nontrivial monotonic improvement guarantees. Experiments demonstrate significantly enhanced policy stability and convergence on long-horizon tasks, overcoming the trust-region breakdown inherent in token-independent methods like PPO, while providing verifiable theoretical guarantees.
📝 Abstract
Policy gradient methods for large language models optimize a surrogate objective computed from samples of a rollout policy $π_{ ext{roll}}$. When $π_{ ext{roll}}
e π_θ$, there is approximation error between the surrogate and the true objective. Prior work has shown that this off-policy mismatch is unavoidable in modern LLM-RL due to implementation divergence, mixture-of-experts routing discontinuities, and distributed training staleness. Classical trust region bounds on the resulting error scale as $O(T^2)$ with sequence length $T$, rendering them vacuous for long-horizon tasks. We derive two tighter bounds: a Pinsker-Marginal bound scaling as $O(T^{3/2})$ and a Mixed bound scaling as $O(T)$. Crucially, both bounds depend on $D_{kl}^{tok,max}$ -- the maximum token-level KL divergence across all positions in a sequence. This is inherently a sequence-level quantity: it requires examining the entire trajectory to compute, and therefore cannot be controlled by token-independent methods like PPO clipping. We propose Trust Region Masking (TRM), which excludes entire sequences from gradient computation if any token violates the trust region, providing the first non-vacuous monotonic improvement guarantees for long-horizon LLM-RL.