QUATRO: Query-Adaptive Trust Region Policy Optimization for LLM Fine-tuning

📅 2026-02-04

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

This work addresses the instability in existing GRPO-based reinforcement learning fine-tuning methods for large language models, which rely on heuristic trust region approximations and struggle to effectively constrain importance ratios that exceed clipping bounds. To overcome this limitation, the authors propose QUATRO, a principled optimization approach that enforces explicit trust region constraints and introduces a query-adaptive mechanism. This mechanism derives intrinsic stabilization terms directly from the exact trust region formulation, enabling controlled policy updates and stable entropy regulation. Experimental results across multiple mathematical reasoning benchmarks demonstrate that QUATRO maintains training stability even under high policy staleness and large step sizes, effectively controls policy entropy, and significantly enhances both optimization robustness and task performance.

Technology Category

Application Category

📝 Abstract

GRPO-style reinforcement learning (RL)-based LLM fine-tuning algorithms have recently gained popularity. Relying on heuristic trust-region approximations, however, they can lead to brittle optimization behavior, as global importance-ratio clipping and group-wise normalization fail to regulate samples whose importance ratios fall outside the clipping range. We propose Query-Adaptive Trust-Region policy Optimization (QUATRO), which directly enforces trust-region constraints through a principled optimization. This yields a clear and interpretable objective that enables explicit control over policy updates and stable, entropy-controlled optimization, with a stabilizer terms arising intrinsically from the exact trust-region formulation. Empirically verified on diverse mathematical reasoning benchmarks, QUATRO shows stable training under increased policy staleness and aggressive learning rates, maintaining well-controlled entropy throughout training.

Problem

Research questions and friction points this paper is trying to address.

trust region

LLM fine-tuning

reinforcement learning

importance ratio clipping

policy optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Trust Region Optimization

LLM Fine-tuning

Reinforcement Learning