🤖 AI Summary
This work addresses a key limitation of conventional supervised fine-tuning (SFT), which enforces strict alignment with a single reference response and thereby overlooks the one-to-many nature of language, often leading to superficial overfitting on non-essential surface forms. The authors observe that high-probability tokens typically encode core semantic content, whereas low-probability tokens often correspond to interchangeable stylistic or phrasal variants. Building on this insight, they propose a probability-guided token selection mechanism that retains only high-probability, semantically critical tokens during SFT while masking out low-probability, replaceable ones. This approach effectively mitigates overfitting induced by single-reference supervision, significantly outperforming standard SFT on both general reasoning and mathematical benchmarks, enhancing model generalization without incurring the computational overhead of diverse response generation.
📝 Abstract
Supervised fine-tuning (SFT) is a fundamental post-training strategy to align Large Language Models (LLMs) with human intent. However, traditional SFT often ignores the one-to-many nature of language by forcing alignment with a single reference answer, leading to the model overfitting to non-core expressions. Although our empirical analysis suggests that introducing multiple reference answers can mitigate this issue, the prohibitive data and computational costs necessitate a strategic shift: prioritizing the mitigation of single-reference overfitting over the costly pursuit of answer diversity. To achieve this, we reveal the intrinsic connection between token probability and semantic importance: high-probability tokens carry the core logical framework, while low-probability tokens are mostly replaceable expressions. Based on this insight, we propose ProFit, which selectively masks low-probability tokens to prevent surface-level overfitting. Extensive experiments confirm that ProFit consistently outperforms traditional SFT baselines on general reasoning and mathematical benchmarks.