Retraining with Predicted Hard Labels Provably Increases Model Accuracy

📅 2024-06-17
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the critical question of whether model retraining under randomly corrupted label noise can improve generalization accuracy. We propose a hard-label retraining mechanism and provide the first theoretical proof that, under the linearly separable data assumption, retraining with hard labels predicted by the model itself strictly improves generalization accuracy. Furthermore, we design a consensus-based retraining strategy that retains only samples for which the model’s predicted label agrees with the original (noisy) label—thereby enhancing both robustness and privacy preservation. Integrated within a label differential privacy (DP) training framework, this approach achieves over a 6% accuracy gain on CIFAR-100 using ResNet-18 under ε = 3 label DP, significantly improving learning efficacy under label noise and strengthening the privacy–utility trade-off.

Technology Category

Application Category

📝 Abstract
The performance of a model trained with noisy labels is often improved by simply extit{retraining} the model with its extit{own predicted hard labels} (i.e., 1/0 labels). Yet, a detailed theoretical characterization of this phenomenon is lacking. In this paper, we theoretically analyze retraining in a linearly separable binary classification setting with randomly corrupted labels given to us and prove that retraining can improve the population accuracy obtained by initially training with the given (noisy) labels. To the best of our knowledge, this is the first such theoretical result. Retraining finds application in improving training with local label differential privacy (DP) which involves training with noisy labels. We empirically show that retraining selectively on the samples for which the predicted label matches the given label significantly improves label DP training at no extra privacy cost; we call this consensus-based retraining. As an example, when training ResNet-18 on CIFAR-100 with $epsilon=3$ label DP, we obtain more than 6% improvement in accuracy with consensus-based retraining.
Problem

Research questions and friction points this paper is trying to address.

Theoretical analysis of retraining with predicted hard labels
Improving model accuracy using noisy label retraining
Enhancing label differential privacy training via consensus-based retraining
Innovation

Methods, ideas, or system contributions that make the work stand out.

Retraining with predicted hard labels
Consensus-based retraining strategy
Improves label differential privacy training
🔎 Similar Papers
No similar papers found.