Face Anything: 4D Face Reconstruction from Any Image Sequence

📅 2026-04-21

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

Dynamic faces exhibit significant ambiguity in geometric reconstruction and correspondence due to expressions, non-rigid deformations, and viewpoint variations. This work proposes a unified Transformer-based feed-forward model that, for the first time, introduces canonical facial coordinates as a shared representation, reframing dynamic face reconstruction and dense tracking as a static reconstruction problem in canonical space. The model jointly predicts depth maps and canonical facial coordinates, leveraging multi-view geometric supervision to achieve temporally consistent, high-fidelity 4D reconstructions. It attains state-of-the-art performance across multiple image and video benchmarks, reducing correspondence error by approximately threefold, improving depth accuracy by 16%, and achieving faster inference speed.

Technology Category

Application Category

📝 Abstract

Accurate reconstruction and tracking of dynamic human faces from image sequences is challenging because non-rigid deformations, expression changes, and viewpoint variations occur simultaneously, creating significant ambiguity in geometry and correspondence estimation. We present a unified method for high-fidelity 4D facial reconstruction based on canonical facial point prediction, a representation that assigns each pixel a normalized facial coordinate in a shared canonical space. This formulation transforms dense tracking and dynamic reconstruction into a canonical reconstruction problem, enabling temporally consistent geometry and reliable correspondences within a single feed-forward model. By jointly predicting depth and canonical coordinates, our method enables accurate depth estimation, temporally stable reconstruction, dense 3D geometry, and robust facial point tracking within a single architecture. We implement this formulation using a transformer-based model that jointly predicts depth and canonical facial coordinates, trained using multi-view geometry data that non-rigidly warps into the canonical space. Extensive experiments on image and video benchmarks demonstrate state-of-the-art performance across reconstruction and tracking tasks, achieving approximately 3$\times$ lower correspondence error and faster inference than prior dynamic reconstruction methods, while improving depth accuracy by 16%. These results highlight canonical facial point prediction as an effective foundation for unified feed-forward 4D facial reconstruction.

Problem

Research questions and friction points this paper is trying to address.

4D face reconstruction

non-rigid deformation

expression changes

viewpoint variations

correspondence estimation

Innovation

Methods, ideas, or system contributions that make the work stand out.

canonical facial point prediction

4D face reconstruction

transformer-based model