🤖 AI Summary
Supervised fine-tuning (SFT) of large language models (LLMs) for mathematical and logical reasoning suffers from overfitting, while reinforcement learning (RL) approaches—particularly those based on reward modeling—often encounter mode collapse.
Method: This paper proposes SASR, a staged adaptive self-regulated training framework that integrates curriculum learning and an exam-based evaluation mechanism. SASR introduces a dual-metric dynamic weighting strategy—driven by gradient norm and distributional divergence—to enable real-time, theoretically grounded coordination between SFT and Generalized Reward-Policy Optimization (GRPO) RL, abandoning static phase-switching paradigms.
Contribution/Results: SASR significantly reduces reliance on high-quality human annotations and enhances cross-task generalization. Empirical evaluation demonstrates that SASR consistently outperforms pure SFT, pure RL, and existing static hybrid methods across diverse reasoning benchmarks—maintaining high answer accuracy while substantially improving reasoning path diversity and robustness.
📝 Abstract
Large language models (LLMs) excel at mathematical reasoning and logical problem-solving. The current popular training paradigms primarily use supervised fine-tuning (SFT) and reinforcement learning (RL) to enhance the models' reasoning abilities. However, when using SFT or RL alone, there are respective challenges: SFT may suffer from overfitting, while RL is prone to mode collapse. The state-of-the-art methods have proposed hybrid training schemes. However, static switching faces challenges such as poor generalization across different tasks and high dependence on data quality. In response to these challenges, inspired by the curriculum learning-quiz mechanism in human reasoning cultivation, We propose SASR, a step-wise adaptive hybrid training framework that theoretically unifies SFT and RL and dynamically balances the two throughout optimization. SASR uses SFT for initial warm-up to establish basic reasoning skills, and then uses an adaptive dynamic adjustment algorithm based on gradient norm and divergence relative to the original distribution to seamlessly integrate SFT with the online RL method GRPO. By monitoring the training status of LLMs and adjusting the training process in sequence, SASR ensures a smooth transition between training schemes, maintaining core reasoning abilities while exploring different paths. Experimental results demonstrate that SASR outperforms SFT, RL, and static hybrid training methods.