π€ AI Summary
This work addresses the longstanding trade-off in classification and structured prediction between optimization efficiency and statistical consistency: smooth losses are easy to optimize but converge slowly, while piecewise-linear losses offer linear consistency yet lack differentiability. The paper proposes Linear-Core (LC) surrogate lossesβthe first globally differentiable, convex loss functions that strictly satisfy linear H-consistency. By seamlessly blending a linear core with smooth tails, LC enables unbiased stochastic gradient estimation while maintaining everywhere differentiability, thereby circumventing the quadratic-complexity bottleneck of exact inference in structured prediction. Experiments demonstrate that LC achieves a 23Γ speedup over Structured SVM on large-vocabulary sequence labeling tasks and improves accuracy by 2.6% over cross-entropy on noisy CIFAR-10.
π Abstract
The choice of loss function in classification involves a fundamental trade-off: smooth losses (like Cross-Entropy) enable fast optimization rates but yield slow square-root consistency bounds, while piecewise-linear losses (like Hinge) offer fast linear consistency rates but suffer from non-differentiability. We propose Linear-Core (LC) Surrogates, a new family of convex loss functions that resolve this tension by stitching a linear core to a smooth tail. We prove that these surrogates are differentiable everywhere while retaining strict linear $H$-consistency bounds, effectively combining the optimization benefits of smoothness with the statistical efficiency of margin-based losses. In the structured prediction setting, we show that this smoothness unlocks a massive computational and energy advantage: it allows for an unbiased stochastic gradient estimator that bypasses the quadratic complexity $O(|\mathscr{Y}|^2)$ of exact inference (e.g., Viterbi). Empirically, our method achieves a 23$\times$ speedup over Structured SVMs on large-vocabulary sequence tagging tasks and demonstrates superior robustness to instance-dependent label noise, outperforming Cross-Entropy by 2.6% on corrupted CIFAR-10.