🤖 AI Summary
Existing face representation pretraining methods face three key challenges: insufficient fine-grained semantic modeling, neglect of facial anatomical spatial structure, and low efficiency in leveraging scarce labeled data. This paper proposes an unsupervised face representation pretraining framework addressing these issues. Its core contributions are: (1) a semantic-aware structured masking strategy to enhance local discriminability; (2) a candidate codebook-driven block-level encoding scheme with patch-pixel alignment, explicitly enforcing facial geometric constraints; and (3) an end-to-end learnable codebook coupled with spatial consistency regularization. Trained solely on 2 million unlabeled face images, the method achieves state-of-the-art performance across diverse downstream tasks—including face recognition, pose estimation, and occlusion-robust analysis—particularly excelling under extreme pose variations, severe occlusions, and challenging illumination conditions.
📝 Abstract
Facial representation pre-training is crucial for tasks like facial recognition, expression analysis, and virtual reality. However, existing methods face three key challenges: (1) failing to capture distinct facial features and fine-grained semantics, (2) ignoring the spatial structure inherent to facial anatomy, and (3) inefficiently utilizing limited labeled data. To overcome these, we introduce PaCo-FR, an unsupervised framework that combines masked image modeling with patch-pixel alignment. Our approach integrates three innovative components: (1) a structured masking strategy that preserves spatial coherence by aligning with semantically meaningful facial regions, (2) a novel patch-based codebook that enhances feature discrimination with multiple candidate tokens, and (3) spatial consistency constraints that preserve geometric relationships between facial components. PaCo-FR achieves state-of-the-art performance across several facial analysis tasks with just 2 million unlabeled images for pre-training. Our method demonstrates significant improvements, particularly in scenarios with varying poses, occlusions, and lighting conditions. We believe this work advances facial representation learning and offers a scalable, efficient solution that reduces reliance on expensive annotated datasets, driving more effective facial analysis systems.