🤖 AI Summary
This work addresses the lack of systematic theoretical analysis for Softmax loss and its approximations in large-scale classification and ranking tasks. The authors propose a unified framework grounded in Fenchel-Young loss theory, establishing consistency criteria for the Softmax family of losses, modeling their gradient convergence dynamics, and providing a bias-variance decomposition alongside iteration complexity characterization for approximation methods. Through comprehensive theoretical analysis and large-scale experiments, the study reveals a strong alignment among consistency, convergence behavior, and empirical performance. It further offers the first explicit characterization of the trade-off between computational efficiency and model performance, thereby delivering both theoretical foundations and practical guidelines for selecting loss functions in large-scale machine learning.
📝 Abstract
The Softmax loss is one of the most widely employed surrogate objectives for classification and ranking tasks. To elucidate its theoretical properties, the Fenchel-Young framework situates it as a canonical instance within a broad family of surrogates. Concurrently, another line of research has addressed scalability when the number of classes is exceedingly large, in which numerous approximations have been proposed to retain the benefits of the exact objective while improving efficiency. Building on these two perspectives, we present a principled investigation of the Softmax-family losses. We examine whether different surrogates achieve consistency with classification and ranking metrics, and analyze their gradient dynamics to reveal distinct convergence behaviors. We also introduce a systematic bias-variance decomposition for approximate methods that provides convergence guarantees, and further derive a per-epoch complexity analysis, showing explicit trade-offs between effectiveness and efficiency. Extensive experiments on a representative task demonstrate a strong alignment between consistency, convergence, and empirical performance. Together, these results establish a principled foundation and offer practical guidance for loss selections in large-class machine learning applications.