Trust-Region Adaptive Policy Optimization

📅 2025-12-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the reasoning performance bottleneck in LLMs arising from imitation rigidity, suppressed exploration, and capability forgetting during two-stage training (SFT → RL), this paper proposes a dynamic interleaved unified optimization framework. Methodologically, it integrates supervised fine-tuning (SFT), reinforcement learning (RL), trust-region optimization, and KL-divergence control. Its core innovations are: (1) Trust-Region Supervised Fine-Tuning (TrSFT), which ensures stable parameter updates under forward KL constraints and adaptively switches to reverse KL for out-of-distribution samples; and (2) a utility-driven dynamic expert prefix selection mechanism that schedules SFT or RL optimization paths token-wise based on input prefixes. Evaluated on five mathematical reasoning benchmarks, the method consistently outperforms standard SFT, RL, SFT-then-RL, and current state-of-the-art approaches, delivering significant gains in complex reasoning capabilities.

Technology Category

Application Category

📝 Abstract
Post-training methods, especially Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), play an important role in improving large language models' (LLMs) complex reasoning abilities. However, the dominant two-stage pipeline (SFT then RL) suffers from a key inconsistency: SFT enforces rigid imitation that suppresses exploration and induces forgetting, limiting RL's potential for improvements. We address this inefficiency with TRAPO ( extbf{T}rust- extbf{R}egion extbf{A}daptive extbf{P}olicy extbf{O}ptimization), a hybrid framework that interleaves SFT and RL within each training instance by optimizing SFT loss on expert prefixes and RL loss on the model's own completions, unifying external supervision and self-exploration. To stabilize training, we introduce Trust-Region SFT (TrSFT), which minimizes forward KL divergence inside a trust region but attenuates optimization outside, effectively shifting toward reverse KL and yielding stable, mode-seeking updates favorable for RL. An adaptive prefix-selection mechanism further allocates expert guidance based on measured utility. Experiments on five mathematical reasoning benchmarks show that TRAPO consistently surpasses standard SFT, RL, and SFT-then-RL pipelines, as well as recent state-of-the-art approaches, establishing a strong new paradigm for reasoning-enhanced LLMs.
Problem

Research questions and friction points this paper is trying to address.

Addresses inconsistency between supervised fine-tuning and reinforcement learning in LLMs
Unifies external supervision and self-exploration within single training instances
Stabilizes training with trust-region optimization to prevent forgetting during exploration
Innovation

Methods, ideas, or system contributions that make the work stand out.

TRAPO interleaves SFT and RL within each training instance
Trust-Region SFT stabilizes training with mode-seeking updates
Adaptive prefix-selection allocates expert guidance based on utility
🔎 Similar Papers
No similar papers found.
M
Mingyu Su
The Conversational AI (CoAI) Group, Tsinghua University
J
Jian Guan
Ant Group
Yuxian Gu
Yuxian Gu
Tsinghua University
Natural Language Processing
M
Minlie Huang
The Conversational AI (CoAI) Group, Tsinghua University
Hongning Wang
Hongning Wang
Associate Professor, Department of Computer Science and Technology, Tsinghua University
Machine LearningInformation RetrievalLarge Language Models