DiffProb: Data Pruning for Face Recognition

📅 2025-05-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the high computational cost, storage overhead, and privacy risks induced by large-scale labeled data in face recognition. We propose the first data pruning method tailored to this task. Our approach models intra-class redundancy via consistency of predicted probability distributions across samples, identifying and removing information-redundant instances; it further incorporates a lightweight label cleaning mechanism to jointly achieve data compression and quality enhancement. Key innovations include: (i) the first application of data pruning to face recognition; (ii) using intra-class prediction consistency as a principled metric for sample informativeness; and (iii) co-optimizing pruning and label correction. When pruning 50% of CASIA-WebFace, performance on LFW, CFP-FP, and IJB-C remains stable or improves; the method is compatible with mainstream architectures and loss functions; and it significantly reduces training cost and storage requirements.

Technology Category

Application Category

📝 Abstract
Face recognition models have made substantial progress due to advances in deep learning and the availability of large-scale datasets. However, reliance on massive annotated datasets introduces challenges related to training computational cost and data storage, as well as potential privacy concerns regarding managing large face datasets. This paper presents DiffProb, the first data pruning approach for the application of face recognition. DiffProb assesses the prediction probabilities of training samples within each identity and prunes the ones with identical or close prediction probability values, as they are likely reinforcing the same decision boundaries, and thus contribute minimally with new information. We further enhance this process with an auxiliary cleaning mechanism to eliminate mislabeled and label-flipped samples, boosting data quality with minimal loss. Extensive experiments on CASIA-WebFace with different pruning ratios and multiple benchmarks, including LFW, CFP-FP, and IJB-C, demonstrate that DiffProb can prune up to 50% of the dataset while maintaining or even, in some settings, improving the verification accuracies. Additionally, we demonstrate DiffProb's robustness across different architectures and loss functions. Our method significantly reduces training cost and data volume, enabling efficient face recognition training and reducing the reliance on massive datasets and their demanding management.
Problem

Research questions and friction points this paper is trying to address.

Reducing computational cost and storage in face recognition training
Addressing privacy concerns from large face datasets management
Maintaining accuracy while pruning redundant face recognition data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Prunes data by prediction probability similarity
Uses auxiliary cleaning to remove mislabeled samples
Maintains accuracy while reducing dataset size significantly
🔎 Similar Papers
No similar papers found.