Pixel3DMM: Versatile Screen-Space Priors for Single-Image 3D Face Reconstruction

📅 2025-05-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses high-fidelity 3D face reconstruction from a single RGB image. We propose Pixel3DMM, a vision transformer leveraging DINO features that regresses surface normals and UV coordinates pixel-wise, enabling differentiable optimization of FLAME parameters. Our method introduces the first screen-space geometric prior mechanism and establishes the first large-scale benchmark for single-image reconstruction—covering diverse expressions, poses, and ethnicities (976K images across 1,000+ identities)—which uniquely supports joint geometric accuracy evaluation under both neutral and dynamic expressions. On this benchmark, Pixel3DMM reduces geometric error by over 15% compared to the strongest baseline, significantly improving reconstruction fidelity for posed expressions and enhancing cross-pose and cross-expression generalization.

Technology Category

Application Category

📝 Abstract
We address the 3D reconstruction of human faces from a single RGB image. To this end, we propose Pixel3DMM, a set of highly-generalized vision transformers which predict per-pixel geometric cues in order to constrain the optimization of a 3D morphable face model (3DMM). We exploit the latent features of the DINO foundation model, and introduce a tailored surface normal and uv-coordinate prediction head. We train our model by registering three high-quality 3D face datasets against the FLAME mesh topology, which results in a total of over 1,000 identities and 976K images. For 3D face reconstruction, we propose a FLAME fitting opitmization that solves for the 3DMM parameters from the uv-coordinate and normal estimates. To evaluate our method, we introduce a new benchmark for single-image face reconstruction, which features high diversity facial expressions, viewing angles, and ethnicities. Crucially, our benchmark is the first to evaluate both posed and neutral facial geometry. Ultimately, our method outperforms the most competitive baselines by over 15% in terms of geometric accuracy for posed facial expressions.
Problem

Research questions and friction points this paper is trying to address.

Reconstruct 3D faces from single RGB images
Predict per-pixel geometric cues for 3DMM optimization
Improve geometric accuracy for posed facial expressions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses vision transformers for per-pixel geometric cues
Leverages DINO features for surface normal prediction
Optimizes FLAME model with uv-coordinate estimates
🔎 Similar Papers
No similar papers found.