🤖 AI Summary
Cross-entropy loss imposes overly restrictive assumptions on label distributions and exhibits limited robustness in multi-class classification and language modeling. Method: This paper proposes a generalized loss–operator co-design framework grounded in *f*-divergences: (1) It introduces a family of differentiable convex *f*-divergence losses that generalize cross-entropy and accommodate non-uniform reference measures; (2) it defines the corresponding *f*-softargmax operator, establishing—for the first time—a rigorous duality between loss and operator; (3) it develops an efficient parallel bisection algorithm for computing arbitrary *f*-softargmax. Results: Empirical evaluation shows that the α-divergence loss with α = 1.5 consistently outperforms standard cross-entropy across pretraining, supervised fine-tuning, and knowledge distillation, delivering improved performance and robustness on multiple language modeling benchmarks—thereby validating the effectiveness and practicality of divergence-driven generalization.
📝 Abstract
The logistic loss (a.k.a. cross-entropy loss) is one of the most popular loss functions used for multiclass classification. It is also the loss function of choice for next-token prediction in language modeling. It is associated with the Kullback--Leibler (KL) divergence and the softargmax operator. In this work, we propose to construct new convex loss functions based on $f$-divergences. Our loss functions generalize the logistic loss in two directions: i) by replacing the KL divergence with $f$-divergences and ii) by allowing non-uniform reference measures. We instantiate our framework for numerous $f$-divergences, recovering existing losses and creating new ones. By analogy with the logistic loss, the loss function generated by an $f$-divergence is associated with an operator, that we dub $f$-softargmax. We derive a novel parallelizable bisection algorithm for computing the $f$-softargmax associated with any $f$-divergence. On the empirical side, one of the goals of this paper is to determine the effectiveness of loss functions beyond the classical cross-entropy in a language model setting, including on pre-training, post-training (SFT) and distillation. We show that the loss function generated by the $alpha$-divergence (which is equivalent to Tsallis $alpha$-negentropy in the case of unit reference measures) with $alpha=1.5$ performs well across several tasks.