FlexAvatar: Learning Complete 3D Head Avatars with Partial Supervision

📅 2025-12-17

📈 Citations: 0

✨ Influential: 0

career value

222K/year

🤖 AI Summary

To address geometric incompleteness and poor cross-view generalization in single-image 3D head avatar reconstruction, this paper proposes the first unified framework supporting joint training on monocular and multi-view data. Our core innovation introduces learnable “bias sink” source tokens within a Transformer architecture to adaptively fuse heterogeneous supervision—including monocular depth/normal constraints and multi-view SfM point clouds. We further design an implicit, disentangled 3D head representation that explicitly models geometry, appearance, and pose. The method enables identity interpolation, flexible fitting to arbitrary numbers of observations, and monocular-driven animation. It significantly outperforms prior approaches on single-view reconstruction, few-shot generalization, and facial animation. Notably, it achieves, for the first time under pure monocular supervision, geometrically complete, photorealistically textured, and cross-view consistent 3D head reconstruction.

Technology Category

Application Category

📝 Abstract

We introduce FlexAvatar, a method for creating high-quality and complete 3D head avatars from a single image. A core challenge lies in the limited availability of multi-view data and the tendency of monocular training to yield incomplete 3D head reconstructions. We identify the root cause of this issue as the entanglement between driving signal and target viewpoint when learning from monocular videos. To address this, we propose a transformer-based 3D portrait animation model with learnable data source tokens, so-called bias sinks, which enables unified training across monocular and multi-view datasets. This design leverages the strengths of both data sources during inference: strong generalization from monocular data and full 3D completeness from multi-view supervision. Furthermore, our training procedure yields a smooth latent avatar space that facilitates identity interpolation and flexible fitting to an arbitrary number of input observations. In extensive evaluations on single-view, few-shot, and monocular avatar creation tasks, we verify the efficacy of FlexAvatar. Many existing methods struggle with view extrapolation while FlexAvatar generates complete 3D head avatars with realistic facial animations. Website: https://tobias-kirschstein.github.io/flexavatar/

Problem

Research questions and friction points this paper is trying to address.

Creating complete 3D head avatars from single images

Overcoming incomplete reconstructions from monocular training

Unifying monocular and multi-view data for full 3D completeness

Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformer-based model with bias sinks

Unified training on monocular and multi-view data

Smooth latent space for interpolation and fitting

🔎 Similar Papers

No similar papers found.