Step-wise Adaptive Integration of Supervised Fine-tuning and Reinforcement Learning for Task-Specific LLMs

📅 2025-05-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Supervised fine-tuning (SFT) of large language models (LLMs) for mathematical and logical reasoning suffers from overfitting, while reinforcement learning (RL) approaches—particularly those based on reward modeling—often encounter mode collapse. Method: This paper proposes SASR, a staged adaptive self-regulated training framework that integrates curriculum learning and an exam-based evaluation mechanism. SASR introduces a dual-metric dynamic weighting strategy—driven by gradient norm and distributional divergence—to enable real-time, theoretically grounded coordination between SFT and Generalized Reward-Policy Optimization (GRPO) RL, abandoning static phase-switching paradigms. Contribution/Results: SASR significantly reduces reliance on high-quality human annotations and enhances cross-task generalization. Empirical evaluation demonstrates that SASR consistently outperforms pure SFT, pure RL, and existing static hybrid methods across diverse reasoning benchmarks—maintaining high answer accuracy while substantially improving reasoning path diversity and robustness.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) excel at mathematical reasoning and logical problem-solving. The current popular training paradigms primarily use supervised fine-tuning (SFT) and reinforcement learning (RL) to enhance the models' reasoning abilities. However, when using SFT or RL alone, there are respective challenges: SFT may suffer from overfitting, while RL is prone to mode collapse. The state-of-the-art methods have proposed hybrid training schemes. However, static switching faces challenges such as poor generalization across different tasks and high dependence on data quality. In response to these challenges, inspired by the curriculum learning-quiz mechanism in human reasoning cultivation, We propose SASR, a step-wise adaptive hybrid training framework that theoretically unifies SFT and RL and dynamically balances the two throughout optimization. SASR uses SFT for initial warm-up to establish basic reasoning skills, and then uses an adaptive dynamic adjustment algorithm based on gradient norm and divergence relative to the original distribution to seamlessly integrate SFT with the online RL method GRPO. By monitoring the training status of LLMs and adjusting the training process in sequence, SASR ensures a smooth transition between training schemes, maintaining core reasoning abilities while exploring different paths. Experimental results demonstrate that SASR outperforms SFT, RL, and static hybrid training methods.
Problem

Research questions and friction points this paper is trying to address.

Overfitting in supervised fine-tuning (SFT) for LLMs
Mode collapse in reinforcement learning (RL) for LLMs
Poor generalization in static hybrid training methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Step-wise adaptive hybrid training framework
Dynamic balance between SFT and RL
Gradient norm-based adaptive adjustment algorithm
🔎 Similar Papers
No similar papers found.
J
Jack Chen
Shanghai Goku Technologies Limited, Shanghai AllMind Artificial Intelligence Technology Co., Ltd.
Fazhong Liu
Fazhong Liu
Shanghai Jiao Tong University
AI security
N
Naruto Liu
Shanghai Goku Technologies Limited, Shanghai AllMind Artificial Intelligence Technology Co., Ltd.
Yuhan Luo
Yuhan Luo
Assistant Professor, City University of Hong Kong
Human-Computer InteractionHealth InformaticsUbiquitous computingPersonal Informatics
E
Erqu Qin
Shanghai Goku Technologies Limited, Shanghai AllMind Artificial Intelligence Technology Co., Ltd.
H
Harry Zheng
Shanghai Goku Technologies Limited, Shanghai AllMind Artificial Intelligence Technology Co., Ltd.
Tian Dong
Tian Dong
Shanghai Jiao Tong University
Computer SecurityMachine Learning
H
Haojin Zhu
Shanghai Jiao Tong University
Y
Yan Meng
Shanghai Jiao Tong University
X
Xiao Wang
Shanghai Goku Technologies Limited, Shanghai AllMind Artificial Intelligence Technology Co., Ltd.