InfoSFT: Learn More and Forget Less with Information-Aware Token Weighting

📅 2026-05-14

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

Standard supervised fine-tuning (SFT) treats all training samples uniformly, often leading to overfitting on low-likelihood examples and catastrophic forgetting of pre-trained capabilities. To address this, this work proposes an information-aware token weighting strategy that automatically emphasizes medium-confidence, highly informative tokens in the loss function—requiring only a single-line code modification. The method consistently outperforms standard SFT and likelihood-weighting baselines across mathematical reasoning, code generation, and chain-of-thought tasks, while more effectively preserving the model’s pre-trained knowledge. By selectively attending to tokens that offer the greatest learning signal without overwhelming the model with noisy or overly confident predictions, the approach achieves a better balance between acquiring new skills and retaining foundational competencies.

📝 Abstract

Supervised fine-tuning (SFT) provides the standard approach for teaching LLMs new behaviors from offline expert demonstrations. However, standard SFT uniformly fits all samples -- including those with low likelihood under the base model -- which can disproportionately drive training updates toward overfitting specific samples rather than learning the target behavior. Moreover, adapting to these unlikely samples induces substantial policy shifts that degrade prior capabilities. Existing methods mitigate this by filtering, regenerating, or down-weighting low-likelihood data. In doing so, they often suppress precisely the novel behaviors the base model has yet to learn. We propose InfoSFT, a principled weighting scheme for the SFT objective that concentrates learning signals on maximally informative, medium-confidence tokens -- those neither overly familiar to the base model nor too unlikely to cause instability. Requiring only a one-line modification to the standard token-wise loss, InfoSFT demonstrably improves generalization over vanilla SFT and likelihood-weighted baselines across math, code, and chain-of-thought tasks with diverse model families, while better preserving pre-existing capabilities.

Problem

Research questions and friction points this paper is trying to address.

Supervised Fine-Tuning

Overfitting

Policy Shift

Low-Likelihood Samples

Capability Preservation

Innovation

Methods, ideas, or system contributions that make the work stand out.

InfoSFT

information-aware weighting

supervised fine-tuning