🤖 AI Summary
This work addresses the instability in existing GRPO-based reinforcement learning fine-tuning methods for large language models, which rely on heuristic trust region approximations and struggle to effectively constrain importance ratios that exceed clipping bounds. To overcome this limitation, the authors propose QUATRO, a principled optimization approach that enforces explicit trust region constraints and introduces a query-adaptive mechanism. This mechanism derives intrinsic stabilization terms directly from the exact trust region formulation, enabling controlled policy updates and stable entropy regulation. Experimental results across multiple mathematical reasoning benchmarks demonstrate that QUATRO maintains training stability even under high policy staleness and large step sizes, effectively controls policy entropy, and significantly enhances both optimization robustness and task performance.
📝 Abstract
GRPO-style reinforcement learning (RL)-based LLM fine-tuning algorithms have recently gained popularity. Relying on heuristic trust-region approximations, however, they can lead to brittle optimization behavior, as global importance-ratio clipping and group-wise normalization fail to regulate samples whose importance ratios fall outside the clipping range. We propose Query-Adaptive Trust-Region policy Optimization (QUATRO), which directly enforces trust-region constraints through a principled optimization. This yields a clear and interpretable objective that enables explicit control over policy updates and stable, entropy-controlled optimization, with a stabilizer terms arising intrinsically from the exact trust-region formulation. Empirically verified on diverse mathematical reasoning benchmarks, QUATRO shows stable training under increased policy staleness and aggressive learning rates, maintaining well-controlled entropy throughout training.