🤖 AI Summary
Supervised fine-tuning (SFT) suffers from limited generalization, primarily because it inherits the pretraining objective—negative log-likelihood (NLL)—which is suboptimal post-pretraining, where models already possess task-specific priors and supervision signals are often redundant or noisy. Method: We introduce the “model capability continuum” and systematically investigate a family of probability-based objectives (e.g., −p, −p¹⁰, thresholded variants) across capability stages. Experiments span seven backbone models, 14 benchmarks, and three domains. Contribution/Results: We find that stronger models benefit significantly from objectives that downweight low-probability tokens—yielding substantial generalization gains—whereas weaker models remain best served by standard NLL. This work challenges the one-size-fits-all NLL paradigm in SFT, establishing stage-aware objective selection principles and advancing post-pretraining methodology toward capability-aware optimization.
📝 Abstract
Supervised fine-tuning (SFT) is the standard approach for post-training large language models (LLMs), yet it often shows limited generalization. We trace this limitation to its default training objective: negative log likelihood (NLL). While NLL is classically optimal when training from scratch, post-training operates in a different paradigm and could violate its optimality assumptions, where models already encode task-relevant priors and supervision can be long and noisy. To this end, we study a general family of probability-based objectives and characterize their effectiveness under different conditions. Through comprehensive experiments and extensive ablation studies across 7 model backbones, 14 benchmarks, and 3 domains, we uncover a critical dimension that governs objective behavior: the model-capability continuum. Near the model-strong end, prior-leaning objectives that downweight low-probability tokens (e.g., $-p$, $-p^{10}$, thresholded variants) consistently outperform NLL; toward the model-weak end, NLL dominates; in between, no single objective prevails. Our theoretical analysis further elucidates how objectives trade places across the continuum, providing a principled foundation for adapting objectives to model capability. Our code is available at https://github.com/GaotangLi/Beyond-Log-Likelihood.