Retraining with Predicted Hard Labels Provably Increases Model Accuracy

📅 2024-06-17

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

193K/year

🤖 AI Summary

This work addresses the critical question of whether model retraining under randomly corrupted label noise can improve generalization accuracy. We propose a hard-label retraining mechanism and provide the first theoretical proof that, under the linearly separable data assumption, retraining with hard labels predicted by the model itself strictly improves generalization accuracy. Furthermore, we design a consensus-based retraining strategy that retains only samples for which the model’s predicted label agrees with the original (noisy) label—thereby enhancing both robustness and privacy preservation. Integrated within a label differential privacy (DP) training framework, this approach achieves over a 6% accuracy gain on CIFAR-100 using ResNet-18 under ε = 3 label DP, significantly improving learning efficacy under label noise and strengthening the privacy–utility trade-off.

Technology Category

Application Category

📝 Abstract

The performance of a model trained with noisy labels is often improved by simply extit{retraining} the model with its extit{own predicted hard labels} (i.e., 1/0 labels). Yet, a detailed theoretical characterization of this phenomenon is lacking. In this paper, we theoretically analyze retraining in a linearly separable binary classification setting with randomly corrupted labels given to us and prove that retraining can improve the population accuracy obtained by initially training with the given (noisy) labels. To the best of our knowledge, this is the first such theoretical result. Retraining finds application in improving training with local label differential privacy (DP) which involves training with noisy labels. We empirically show that retraining selectively on the samples for which the predicted label matches the given label significantly improves label DP training at no extra privacy cost; we call this consensus-based retraining. As an example, when training ResNet-18 on CIFAR-100 with $epsilon=3$ label DP, we obtain more than 6% improvement in accuracy with consensus-based retraining.

Problem

Research questions and friction points this paper is trying to address.

Theoretical analysis of retraining with predicted hard labels

Improving model accuracy using noisy label retraining

Enhancing label differential privacy training via consensus-based retraining

Innovation

Methods, ideas, or system contributions that make the work stand out.

Retraining with predicted hard labels

Consensus-based retraining strategy

Improves label differential privacy training

🔎 Similar Papers

No similar papers found.