3D Facial Expressions through Analysis-by-Neural-Synthesis

📅 2024-04-05
🏛️ Computer Vision and Pattern Recognition
📈 Citations: 16
Influential: 3
📄 PDF
🤖 AI Summary
Existing 3D face reconstruction methods recover coarse geometry from in-the-wild images but struggle to accurately model fine-grained, extreme, asymmetric, or rare facial expressions. To address this, we propose SMIRK—the first framework to introduce neural rendering into self-supervised 3D expression reconstruction, replacing conventional differentiable rendering to enable decoupled optimization of geometry and appearance. We design an identity-consistent expression augmentation mechanism to synthesize diverse training data, substantially improving expressiveness modeling. Additionally, we incorporate sparse pixel sampling and a self-supervised geometric reconstruction loss to enhance detail fidelity. Extensive qualitative, quantitative, and perceptual evaluations demonstrate state-of-the-art performance, with particularly significant improvements in reconstructing extreme and asymmetric expressions.

Technology Category

Application Category

📝 Abstract
While existing methods for 3D face reconstruction from in-the-wild images excel at recovering the overall face shape, they commonly miss subtle, extreme, asymmetric, or rarely observed expressions. We improve upon these meth-ods with SMIRK (Spatial Modeling for Image-based Reconstruction of Kinesics), which faithfully reconstructs expres-sive 3D faces from images. We identify two key limitations in existing methods: shortcomings in their self-supervised training formulation, and a lack of expression diversity in the training images. For training, most methods employ differentiable rendering to compare a predicted face mesh with the input image, along with a plethora of additional loss functions. This differentiable rendering loss not only has to provide supervision to optimize for 3D face geom-etry, camera, albedo, and lighting, which is an ill-posed optimization problem, but the domain gap between ren-dering and input image further hinders the learning pro-cess. Instead, SMIRK replaces the differentiable rendering with a neural rendering module that, given the ren-dered predicted mesh geometry, and sparsely sampled pix-els of the input image, generates a face image. As the neural rendering gets color information from sampled im-age pixels, supervising with neural rendering-based reconstruction loss can focus solely on the geometry. Further it enables us to generate images of the input identity with varying expressions while training. These are then utilized as input to the reconstruction model and used as supervision with ground truth geometry. This effectively augments the training data and enhances the generalization for di-verse expressions. Our qualitative, quantitative and partic-ularly our perceptual evaluations demonstrate that SMIRK achieves the new state-of-the art performance on accurate expression reconstruction. For our method's source code, demo video and more, please visit our project webpage: https://georgeretsi.github.io/smirk/.
Problem

Research questions and friction points this paper is trying to address.

Improves 3D face reconstruction for subtle and extreme expressions.
Addresses limitations in self-supervised training and expression diversity.
Introduces neural rendering to enhance geometry-focused reconstruction accuracy.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Neural rendering replaces differentiable rendering
Training data augmented with varied expressions
Focuses on geometry with neural rendering loss