Towards Understanding Why Label Smoothing Degrades Selective Classification and How to Fix It

📅 2024-03-19

📈 Citations: 2

✨ Influential: 0

career value

158K/year

🤖 AI Summary

Label smoothing (LS), while improving classification accuracy, systematically degrades selective classification (SC) performance by distorting logit gradients and weakening the uncertainty ordering between correct and incorrect predictions. Method: We provide the first gradient-level theoretical analysis of LS’s adverse impact on SC and propose a lightweight, interpretable post-hoc logit normalization technique that operates solely during inference—without modifying training—by applying gradient-aware normalization to logits. Contribution/Results: Our method fully restores and even surpasses the original model’s uncertainty calibration, recovering SC performance degraded by LS. Extensive evaluation across diverse architectures (ResNet, ViT) and large-scale benchmarks (ImageNet, CIFAR-100) confirms that LS consistently harms SC, whereas our approach not only mitigates this degradation but achieves superior coverage–risk trade-offs, offering a novel pathway toward trustworthy machine learning.

Technology Category

Application Category

📝 Abstract

Label smoothing (LS) is a popular regularisation method for training neural networks as it is effective in improving test accuracy and is simple to implement. ``Hard'' one-hot labels are ``smoothed'' by uniformly distributing probability mass to other classes, reducing overfitting. Prior work has suggested that in some cases LS can degrade selective classification (SC) -- where the aim is to reject misclassifications using a model's uncertainty. In this work, we first demonstrate empirically across an extended range of large-scale tasks and architectures that LS consistently degrades SC. We then address a gap in existing knowledge, providing an explanation for this behaviour by analysing logit-level gradients: LS degrades the uncertainty rank ordering of correct vs incorrect predictions by suppressing the max logit more when a prediction is likely to be correct, and less when it is likely to be wrong. This elucidates previously reported experimental results where strong classifiers underperform in SC. We then demonstrate the empirical effectiveness of post-hoc logit normalisation for recovering lost SC performance caused by LS. Furthermore, linking back to our gradient analysis, we again provide an explanation for why such normalisation is effective.

Problem

Research questions and friction points this paper is trying to address.

Label smoothing degrades selective classification

Analyzing logit-level gradients explains degradation

Post-hoc logit normalization recovers performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Label smoothing degrades selective classification

Logit-level gradients explain degradation

Post-hoc logit normalisation recovers performance

🔎 Similar Papers

Can We Treat Noisy Labels as Accurate?