TROLL: Trust Regions improve Reinforcement Learning for Large Language Models

📅 2025-10-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing PPO-style clipping objectives in LLM reward fine-tuning suffer from training instability and suboptimal performance due to coarse-grained KL divergence approximations. To address this, we propose Discrete Differentiable Trust-Region Projection (DDTRP), the first principled token-level KL-constrained reinforcement learning method for LLMs. DDTRP identifies a sparse subset of semantically critical tokens and applies exact KL-constrained projection directly on their logits, achieving a favorable trade-off between computational efficiency and optimization stability. The method is architecture-agnostic and compatible with diverse advantage estimators and model families. Extensive experiments across multiple benchmarks, LLM families (e.g., Llama, Qwen), and advantage estimation variants demonstrate that DDTRP consistently outperforms PPO clipping baselines—yielding faster convergence, enhanced training stability, and higher final task success rates.

Technology Category

Application Category

📝 Abstract
On-policy Reinforcement Learning (RL) with PPO-like clip objectives has become the standard choice for reward-based fine-tuning of large language models (LLMs). Although recent work has explored improved estimators of advantages and normalization, the clipping mechanism itself has remained untouched. Originally introduced as a proxy for principled KL-based trust regions, clipping is a crude approximation that often causes unstable updates and suboptimal performance. We replace the clip objective with a novel discrete differentiable trust region projection, which provides principled token-level KL constraints. The projection operates on a sparse subset of the model's most important token logits to balance computational cost and projection effectiveness. Our approach, Trust Region Optimization for Large Language Models (TROLL), serves as a direct replacement for PPO-like clipping during training and does not alter the model's inference behavior. Across datasets, model families, and advantage-estimation methods, TROLL consistently outperforms PPO-like clipping in terms of training speed, stability, and final success rates.
Problem

Research questions and friction points this paper is trying to address.

Replacing PPO clipping with trust region projection for stable training
Addressing suboptimal performance caused by crude KL approximations
Implementing token-level KL constraints while maintaining inference efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Replaces PPO clipping with discrete trust region projection
Applies token-level KL constraints on sparse logits
Maintains inference behavior while improving training performance
🔎 Similar Papers
No similar papers found.