🤖 AI Summary
Face parsing under extreme poses is severely limited by the scarcity of annotated training data. To address this, we propose the first multi-view consistent label optimization framework based on 3D Gaussian Splatting (3DGS): without requiring ground-truth 3D annotations, it jointly optimizes RGB images and initial segmentation masks to reconstruct a shared geometric representation, then renders high-fidelity, pose-consistent segmentation labels across multiple views; these synthetic labels are subsequently used to fine-tune the parsing model. Our method requires only a small set of initially annotated images yet generates high-fidelity, diverse training data. It significantly improves parsing accuracy under extreme poses while preserving performance on standard viewpoints, consistently outperforming state-of-the-art methods across quantitative metrics and human evaluation.
📝 Abstract
Accurate face parsing under extreme viewing angles remains a significant challenge due to limited labeled data in such poses. Manual annotation is costly and often impractical at scale. We propose a novel label refinement pipeline that leverages 3D Gaussian Splatting (3DGS) to generate accurate segmentation masks from noisy multiview predictions. By jointly fitting two 3DGS models, one to RGB images and one to their initial segmentation maps, our method enforces multiview consistency through shared geometry, enabling the synthesis of pose-diverse training data with only minimal post-processing. Fine-tuning a face parsing model on this refined dataset significantly improves accuracy on challenging head poses, while maintaining strong performance on standard views. Extensive experiments, including human evaluations, demonstrate that our approach achieves superior results compared to state-of-the-art methods, despite requiring no ground-truth 3D annotations and using only a small set of initial images. Our method offers a scalable and effective solution for improving face parsing robustness in real- world settings.