🤖 AI Summary
To address inaccurate skin-region segmentation in 3D facial scans—which degrades registration accuracy—this paper proposes an end-to-end, mesh-level segmentation method that jointly leverages multi-view 2D semantic features and 3D geometric features. Innovatively, it freezes a Vision Transformer (ViT) backbone to extract robust 2D semantic features, then aligns and fuses them into mesh vertices via learnable 3D feature projection and voxel-based aggregation. Subsequent refinement is performed using a graph convolutional network (GCN). Notably, the method requires no ground-truth skin annotations and achieves strong cross-domain generalization when trained exclusively on synthetic data. Evaluated on real-world 3D facial scans, it improves registration accuracy by 8.89% over pure 2D baselines and by 14.3% over pure 3D baselines, significantly enhancing facial registration quality.
📝 Abstract
Face registration deforms a template mesh to closely fit a 3D face scan, the quality of which commonly degrades in non-skin regions (e.g., hair, beard, accessories), because the optimized template-to-scan distance pulls the template mesh towards the noisy scan surface. Improving registration quality requires a clean separation of skin and non-skin regions on the scan mesh. Existing image-based (2D) or scan-based (3D) segmentation methods however perform poorly. Image-based segmentation outputs multi-view inconsistent masks, and they cannot account for scan inaccuracies or scan-image misalignment, while scan-based methods suffer from lower spatial resolution compared to images. In this work, we introduce a novel method that accurately separates skin from non-skin geometry on 3D human head scans. For this, our method extracts features from multi-view images using a frozen image foundation model and aggregates these features in 3D. These lifted 2D features are then fused with 3D geometric features extracted from the scan mesh, to then predict a segmentation mask directly on the scan mesh. We show that our segmentations improve the registration accuracy over pure 2D or 3D segmentation methods by 8.89% and 14.3%, respectively. Although trained only on synthetic data, our model generalizes well to real data.