π€ AI Summary
Standard supervised fine-tuning (SFT) often leads large reasoning models to become overconfident and exhibit reduced generation diversity, thereby limiting their exploration capacity during subsequent reinforcement learning (RL) stages. While conventional entropy regularization increases output entropy, it struggles to encourage meaningful exploration. To address this, this work proposes CurioSFT, a novel approach that integrates self-exploration distillation with an entropy-guided, temperature-adaptive mechanism. During SFT, CurioSFT performs token-level adaptive knowledge distillation, preserving factual accuracy while stimulating the modelβs intrinsic curiosity and enhancing exploration at reasoning-critical positions. Experimental results demonstrate that, on mathematical reasoning tasks, CurioSFT improves in-distribution and out-of-distribution performance by 2.5 and 2.9 points, respectively, during SFT, with an additional average gain of 5.0 points after transitioning to reinforcement learning.
π Abstract
The standard post-training recipe for large reasoning models, supervised fine-tuning followed by reinforcement learning (SFT-then-RL), may limit the benefits of the RL stage: while SFT imitates expert demonstrations, it often causes overconfidence and reduces generation diversity, leaving RL with a narrowed solution space to explore. Adding entropy regularization during SFT is not a cure-all; it tends to flatten token distributions toward uniformity, increasing entropy without improving meaningful exploration capability. In this paper, we propose CurioSFT, an entropy-preserving SFT method designed to enhance exploration capabilities through intrinsic curiosity. It consists of (a) Self-Exploratory Distillation, which distills the model toward a self-generated, temperature-scaled teacher to encourage exploration within its capability; and (b) Entropy-Guided Temperature Selection, which adaptively adjusts distillation strength to mitigate knowledge forgetting by amplifying exploration at reasoning tokens while stabilizing factual tokens. Extensive experiments on mathematical reasoning tasks demonstrate that, in SFT stage, CurioSFT outperforms the vanilla SFT by 2.5 points on in-distribution tasks and 2.9 points on out-of-distribution tasks. We also verify that exploration capabilities preserved during SFT successfully translate into concrete gains in RL stage, yielding an average improvement of 5.0 points.