Beyond Log Likelihood: Probability-Based Objectives for Supervised Fine-Tuning across the Model Capability Continuum

📅 2025-10-01

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Supervised fine-tuning (SFT) suffers from limited generalization, primarily because it inherits the pretraining objective—negative log-likelihood (NLL)—which is suboptimal post-pretraining, where models already possess task-specific priors and supervision signals are often redundant or noisy. Method: We introduce the “model capability continuum” and systematically investigate a family of probability-based objectives (e.g., −p, −p¹⁰, thresholded variants) across capability stages. Experiments span seven backbone models, 14 benchmarks, and three domains. Contribution/Results: We find that stronger models benefit significantly from objectives that downweight low-probability tokens—yielding substantial generalization gains—whereas weaker models remain best served by standard NLL. This work challenges the one-size-fits-all NLL paradigm in SFT, establishing stage-aware objective selection principles and advancing post-pretraining methodology toward capability-aware optimization.

Technology Category

Application Category

📝 Abstract

Supervised fine-tuning (SFT) is the standard approach for post-training large language models (LLMs), yet it often shows limited generalization. We trace this limitation to its default training objective: negative log likelihood (NLL). While NLL is classically optimal when training from scratch, post-training operates in a different paradigm and could violate its optimality assumptions, where models already encode task-relevant priors and supervision can be long and noisy. To this end, we study a general family of probability-based objectives and characterize their effectiveness under different conditions. Through comprehensive experiments and extensive ablation studies across 7 model backbones, 14 benchmarks, and 3 domains, we uncover a critical dimension that governs objective behavior: the model-capability continuum. Near the model-strong end, prior-leaning objectives that downweight low-probability tokens (e.g., $-p$, $-p^{10}$, thresholded variants) consistently outperform NLL; toward the model-weak end, NLL dominates; in between, no single objective prevails. Our theoretical analysis further elucidates how objectives trade places across the continuum, providing a principled foundation for adapting objectives to model capability. Our code is available at https://github.com/GaotangLi/Beyond-Log-Likelihood.

Problem

Research questions and friction points this paper is trying to address.

Supervised fine-tuning suffers from limited generalization issues

Negative log likelihood objective violates optimality assumptions in post-training

No single objective performs best across the model capability continuum

Innovation

Methods, ideas, or system contributions that make the work stand out.

Probability-based objectives replace negative log likelihood

Prior-leaning methods downweight low-probability tokens

Objective selection adapts to model capability continuum

🔎 Similar Papers

A Bayesian Model Selection Criterion for Selecting Pretraining Checkpoints

2024-10-08Citations: 0

Authors to Follow