GAF: Gaussian Avatar Reconstruction from Monocular Videos via Multi-view Diffusion

📅 2024-12-13

🏛️ arXiv.org

📈 Citations: 2

✨ Influential: 0

career value

248K/year

🤖 AI Summary

To address severe geometric ambiguity and novel-view artifacts in monocular video-based reconstruction of animatable 3D Gaussian head avatars—particularly in unobserved regions—this paper proposes a multi-view diffusion prior-guided Gaussian splatting framework. Our key contributions are: (1) the first use of a multi-view facial diffusion model to supervise Gaussian optimization, effectively mitigating single-view geometric ambiguity; (2) integration of FLAME-derived normal maps for pixel-level view alignment control; and (3) an iterative denoising image distillation mechanism that injects diffusion priors into the latent space to suppress oversaturation and structural distortion. By combining VAE feature conditioning with latent-space upsampling, our method significantly improves geometric fidelity and rendering consistency. On the NeRSemble dataset, it achieves a 5.34% SSIM gain in novel-view synthesis. When applied to consumer-grade monocular input, it yields state-of-the-art photorealism and fine-grained detail in reconstructed head avatars.

Technology Category

Application Category

📝 Abstract

We propose a novel approach for reconstructing animatable 3D Gaussian avatars from monocular videos captured by commodity devices like smartphones. Photorealistic 3D head avatar reconstruction from such recordings is challenging due to limited observations, which leaves unobserved regions under-constrained and can lead to artifacts in novel views. To address this problem, we introduce a multi-view head diffusion model, leveraging its priors to fill in missing regions and ensure view consistency in Gaussian splatting renderings. To enable precise viewpoint control, we use normal maps rendered from FLAME-based head reconstruction, which provides pixel-aligned inductive biases. We also condition the diffusion model on VAE features extracted from the input image to preserve details of facial identity and appearance. For Gaussian avatar reconstruction, we distill multi-view diffusion priors by using iteratively denoised images as pseudo-ground truths, effectively mitigating over-saturation issues. To further improve photorealism, we apply latent upsampling to refine the denoised latent before decoding it into an image. We evaluate our method on the NeRSemble dataset, showing that GAF outperforms the previous state-of-the-art methods in novel view synthesis by a 5.34% higher SSIM score. Furthermore, we demonstrate higher-fidelity avatar reconstructions from monocular videos captured on commodity devices.

Problem

Research questions and friction points this paper is trying to address.

Reconstructing animatable 3D Gaussian avatars from monocular videos

Addressing limited observations causing artifacts in novel views

Ensuring view consistency and photorealism in avatar reconstruction

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-view diffusion for view consistency

FLAME-based normal maps for viewpoint control

VAE features for identity preservation

🔎 Similar Papers

FitDiff: Robust monocular 3D facial shape and reflectance estimation using Diffusion Models