Trust Region Masking for Long-Horizon LLM Reinforcement Learning

📅 2025-12-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In reinforcement learning for large language models, policy gradient methods suffer from approximation errors that scale quadratically with sequence length $T$ ($O(T^2)$) due to policy mismatch, and conventional trust-region methods fail on long-horizon tasks. Method: We derive the first tight trust-region error bounds of $O(T)$ and $O(T^{3/2})$, and propose Trajectory-level Regularization Masking (TRM), a trajectory-wise masking mechanism that dynamically discards entire sampled trajectories based on the maximum token-level KL divergence within each trajectory, monitored via full-sequence KL divergence. Contribution/Results: TRM establishes the first sequence-level policy optimization framework with nontrivial monotonic improvement guarantees. Experiments demonstrate significantly enhanced policy stability and convergence on long-horizon tasks, overcoming the trust-region breakdown inherent in token-independent methods like PPO, while providing verifiable theoretical guarantees.

Technology Category

Application Category

📝 Abstract
Policy gradient methods for large language models optimize a surrogate objective computed from samples of a rollout policy $π_{ ext{roll}}$. When $π_{ ext{roll}} e π_θ$, there is approximation error between the surrogate and the true objective. Prior work has shown that this off-policy mismatch is unavoidable in modern LLM-RL due to implementation divergence, mixture-of-experts routing discontinuities, and distributed training staleness. Classical trust region bounds on the resulting error scale as $O(T^2)$ with sequence length $T$, rendering them vacuous for long-horizon tasks. We derive two tighter bounds: a Pinsker-Marginal bound scaling as $O(T^{3/2})$ and a Mixed bound scaling as $O(T)$. Crucially, both bounds depend on $D_{kl}^{tok,max}$ -- the maximum token-level KL divergence across all positions in a sequence. This is inherently a sequence-level quantity: it requires examining the entire trajectory to compute, and therefore cannot be controlled by token-independent methods like PPO clipping. We propose Trust Region Masking (TRM), which excludes entire sequences from gradient computation if any token violates the trust region, providing the first non-vacuous monotonic improvement guarantees for long-horizon LLM-RL.
Problem

Research questions and friction points this paper is trying to address.

Addresses off-policy mismatch in LLM reinforcement learning
Tightens trust region bounds for long-horizon sequence tasks
Proposes sequence-level masking to ensure monotonic improvement guarantees
Innovation

Methods, ideas, or system contributions that make the work stand out.

Trust Region Masking excludes sequences violating token-level KL divergence
Tighter bounds scale as O(T) and O(T^(3/2)) with sequence length
Sequence-level control avoids approximation error from off-policy mismatch
🔎 Similar Papers
No similar papers found.