🤖 AI Summary
To address severe geometric ambiguity and novel-view artifacts in monocular video-based reconstruction of animatable 3D Gaussian head avatars—particularly in unobserved regions—this paper proposes a multi-view diffusion prior-guided Gaussian splatting framework. Our key contributions are: (1) the first use of a multi-view facial diffusion model to supervise Gaussian optimization, effectively mitigating single-view geometric ambiguity; (2) integration of FLAME-derived normal maps for pixel-level view alignment control; and (3) an iterative denoising image distillation mechanism that injects diffusion priors into the latent space to suppress oversaturation and structural distortion. By combining VAE feature conditioning with latent-space upsampling, our method significantly improves geometric fidelity and rendering consistency. On the NeRSemble dataset, it achieves a 5.34% SSIM gain in novel-view synthesis. When applied to consumer-grade monocular input, it yields state-of-the-art photorealism and fine-grained detail in reconstructed head avatars.
📝 Abstract
We propose a novel approach for reconstructing animatable 3D Gaussian avatars from monocular videos captured by commodity devices like smartphones. Photorealistic 3D head avatar reconstruction from such recordings is challenging due to limited observations, which leaves unobserved regions under-constrained and can lead to artifacts in novel views. To address this problem, we introduce a multi-view head diffusion model, leveraging its priors to fill in missing regions and ensure view consistency in Gaussian splatting renderings. To enable precise viewpoint control, we use normal maps rendered from FLAME-based head reconstruction, which provides pixel-aligned inductive biases. We also condition the diffusion model on VAE features extracted from the input image to preserve details of facial identity and appearance. For Gaussian avatar reconstruction, we distill multi-view diffusion priors by using iteratively denoised images as pseudo-ground truths, effectively mitigating over-saturation issues. To further improve photorealism, we apply latent upsampling to refine the denoised latent before decoding it into an image. We evaluate our method on the NeRSemble dataset, showing that GAF outperforms the previous state-of-the-art methods in novel view synthesis by a 5.34% higher SSIM score. Furthermore, we demonstrate higher-fidelity avatar reconstructions from monocular videos captured on commodity devices.