Pixels2Points: Fusing 2D and 3D Features for Facial Skin Segmentation

📅 2025-04-28

📈 Citations: 0

✨ Influential: 0

career value

236K/year

🤖 AI Summary

To address inaccurate skin-region segmentation in 3D facial scans—which degrades registration accuracy—this paper proposes an end-to-end, mesh-level segmentation method that jointly leverages multi-view 2D semantic features and 3D geometric features. Innovatively, it freezes a Vision Transformer (ViT) backbone to extract robust 2D semantic features, then aligns and fuses them into mesh vertices via learnable 3D feature projection and voxel-based aggregation. Subsequent refinement is performed using a graph convolutional network (GCN). Notably, the method requires no ground-truth skin annotations and achieves strong cross-domain generalization when trained exclusively on synthetic data. Evaluated on real-world 3D facial scans, it improves registration accuracy by 8.89% over pure 2D baselines and by 14.3% over pure 3D baselines, significantly enhancing facial registration quality.

Technology Category

Application Category

📝 Abstract

Face registration deforms a template mesh to closely fit a 3D face scan, the quality of which commonly degrades in non-skin regions (e.g., hair, beard, accessories), because the optimized template-to-scan distance pulls the template mesh towards the noisy scan surface. Improving registration quality requires a clean separation of skin and non-skin regions on the scan mesh. Existing image-based (2D) or scan-based (3D) segmentation methods however perform poorly. Image-based segmentation outputs multi-view inconsistent masks, and they cannot account for scan inaccuracies or scan-image misalignment, while scan-based methods suffer from lower spatial resolution compared to images. In this work, we introduce a novel method that accurately separates skin from non-skin geometry on 3D human head scans. For this, our method extracts features from multi-view images using a frozen image foundation model and aggregates these features in 3D. These lifted 2D features are then fused with 3D geometric features extracted from the scan mesh, to then predict a segmentation mask directly on the scan mesh. We show that our segmentations improve the registration accuracy over pure 2D or 3D segmentation methods by 8.89% and 14.3%, respectively. Although trained only on synthetic data, our model generalizes well to real data.

Problem

Research questions and friction points this paper is trying to address.

Accurate skin segmentation on 3D head scans

Fusion of 2D image and 3D geometric features

Improving face registration quality by 8.89-14.3%

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fuses 2D image and 3D scan features

Uses frozen image foundation model

Predicts segmentation mask on scan mesh

🔎 Similar Papers

Ensembling convolutional neural networks for human skin segmentation