TransFace: Calibrating Transformer Training for Face Recognition from a Data-Centric Perspective

📅 2023-08-20

🏛️ IEEE International Conference on Computer Vision

📈 Citations: 36

✨ Influential: 5

career value

192K/year

🤖 AI Summary

Existing CNN-based face recognition methods using RGB images suffer from three key limitations: weak global-local feature modeling, low inference efficiency due to RGB input, and high privacy leakage risk. To address ViT’s poor generalization and susceptibility to overfitting on large-scale face data, this work introduces two data-level innovations: (1) DPAP—a patch-level adaptive augmentation method tailored to facial anatomy—to enhance structural fidelity; and (2) EHSM—a dynamic hard sample mining strategy grounded in information entropy—to strengthen token-level semantic discrimination. These components jointly improve ViT training stability and generalization. Extensive experiments demonstrate state-of-the-art performance across multiple mainstream face benchmarks, with significant mitigation of overfitting while simultaneously achieving high accuracy, improved computational efficiency, and inherent privacy-preserving potential through reduced reliance on raw RGB inputs.

📝 Abstract

Vision Transformers (ViTs) have demonstrated powerful representation ability in various visual tasks thanks to their intrinsic data-hungry nature. However, we unexpectedly find that ViTs perform vulnerably when applied to face recognition (FR) scenarios with extremely large datasets. We investigate the reasons for this phenomenon and discover that the existing data augmentation approach and hard sample mining strategy are incompatible with ViTs-based FR backbone due to the lack of tailored consideration on preserving face structural information and leveraging each local token information. To remedy these problems, this paper proposes a superior FR model called TransFace, which employs a patch-level data augmentation strategy named DPAP and a hard sample mining strategy named EHSM. Specially, DPAP randomly perturbs the amplitude information of dominant patches to expand sample diversity, which effectively alleviates the overfitting problem in ViTs. EHSM utilizes the information entropy in the local tokens to dynamically adjust the importance weight of easy and hard samples during training, leading to a more stable prediction. Experiments on several benchmarks demonstrate the superiority of our TransFace. Code and models are available at https://github.com/DanJun6737/TransFace.

Problem

Research questions and friction points this paper is trying to address.

CNN models struggle with global facial feature extraction

RGB image inputs reduce model inference efficiency

RGB inputs create privacy vulnerabilities to hacker attacks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Vision Transformers for global facial features

Employs image bytes instead of RGB for efficiency

Enhances security by reducing privacy risks

🔎 Similar Papers

FaceXFormer: A Unified Transformer for Facial Analysis