🤖 AI Summary
This work addresses the limitations of existing high-fidelity 3D Gaussian head avatar generation methods, which typically rely on multi-view images, 3D scans, or intermediate 2D view synthesis, thereby constraining data acquisition and practical deployment. To overcome these challenges, we propose MVCHead, a framework that learns both conditional and unconditional 3D head models directly from random single-view 2D images—without requiring multi-view supervision, 3D annotations, or intermediate view synthesis—and generates multi-view-consistent 3D Gaussian head avatars in a single forward pass. Key innovations include the HiSS module based on state space models and the HiBiSS scanning mechanism, which effectively model long-range dependencies and recursively align features along the axis of highest inconsistency, as well as an SE(3)-equivariant multi-view discriminator that enforces 3D consistency through self-rendered images. Our method achieves state-of-the-art performance in perceptual quality, texture, and geometric consistency, matches existing approaches in shape fidelity, and introduces FaceGS-10K, the first large-scale 3D Gaussian head dataset.
📝 Abstract
High-fidelity 3D Gaussian head avatar generation is critical for applications such as AR/VR, telepresence, and digital humans. Existing methods depend on multi-view datasets, 3D captures, or intermediate 2D view synthesis. In contrast, we learn both conditional and unconditional 3D head models from randomly sampled 2D images alone, without using multi-view data, 3D supervision, or intermediate view generation. We introduce MVCHead, a single-shot state space model that enforces multi-view consistency (MVC) directly in the 3D representation while regressing 3D Gaussians under these constraints. At its core, we propose a Hierarchical State Space (HiSS) block that progressively refines Gaussians from coarse to fine, while capturing long-range dependencies. Within each HiSS block, we modify Mamba's standard unidirectional scan with the proposed Hierarchical Bi-directional State Scan (HiBiSS) that aligns recurrence with the axes along which multi-view inconsistencies are strongest. Finally, we design an SE(3) Multi-view Critic that judges whether a set of self-renders arises from a single underlying 3D configuration, rewarding cross-view pixel alignment without observing real multi-view pairs. MVCHead achieves state-of-the-art perceptual quality, surpasses prior methods in both texture and geometric consistency, and maintains comparable shape consistency. To demonstrate scalability, we release FaceGS-10K, the first large-scale dataset of ready-to-use 3D Gaussian head assets for training and evaluation of 3D head models. Project Page and code: https://humansensinglab.github.io/MVCHead/