π€ AI Summary
This work proposes a feedforward few-shot modeling approach for generating animatable 3D Gaussian avatars that can be driven in real time, addressing the limitations of existing methods that rely on multi-view data or per-identity optimization. Requiring only a handful of monocular images, the method fuses multi-scale features from DINOv3 and the Stable Diffusion VAE and employs a lightweight MLP-based dynamic network to predict expression-driven Gaussian deformations. This is the first approach to achieve generalizable 3D Gaussian avatar generation without test-time optimization. By integrating geometric supervision with priors from large pre-trained models, the framework significantly enhances geometric smoothness and rendering fidelity, outperforming current state-of-the-art methods in both inference efficiency and animation quality.
π Abstract
Despite recent progress in 3D Gaussian-based head avatar modeling, efficiently generating high fidelity avatars remains a challenge. Current methods typically rely on extensive multi-view capture setups or monocular videos with per-identity optimization during inference, limiting their scalability and ease of use on unseen subjects. To overcome these efficiency drawbacks, we propose FastGHA, a feed-forward method to generate high-quality Gaussian head avatars from only a few input images while supporting real-time animation. Our approach directly learns a per-pixel Gaussian representation from the input images, and aggregates multi-view information using a transformer-based encoder that fuses image features from both DINOv3 and Stable Diffusion VAE. For real-time animation, we extend the explicit Gaussian representations with per-Gaussian features and introduce a lightweight MLP-based dynamic network to predict 3D Gaussian deformations from expression codes. Furthermore, to enhance geometric smoothness of the 3D head, we employ point maps from a pre-trained large reconstruction model as geometry supervision. Experiments show that our approach significantly outperforms existing methods in both rendering quality and inference efficiency, while supporting real-time dynamic avatar animation.