🤖 AI Summary
To address key bottlenecks in single-image-driven 4D facial avatar generation—including geometric distortion, identity/expression inconsistency, and heavy reliance on multi-view data—this paper proposes the first unified framework jointly leveraging shape, image, and video priors. Methodologically: (1) We integrate 3D-GAN inversion with diffusion-based depth-guided texture mapping to enhance geometric fidelity and cross-view texture consistency; (2) we design a video-synchronized driving signal modeling module to improve temporal expression naturalness; (3) we introduce a consistency-inconsistency joint training strategy to explicitly disentangle identity from dynamic attributes. Our approach achieves full-view, high-fidelity 4D reconstruction from a single input image, significantly outperforming state-of-the-art methods. Quantitative and qualitative evaluations demonstrate superior performance across geometry accuracy, cross-view consistency, and animation quality metrics.
📝 Abstract
We present a novel framework for generating high-quality, animatable 4D avatar from a single image. While recent advances have shown promising results in 4D avatar creation, existing methods either require extensive multiview data or struggle with shape accuracy and identity consistency. To address these limitations, we propose a comprehensive system that leverages shape, image, and video priors to create full-view, animatable avatars. Our approach first obtains initial coarse shape through 3D-GAN inversion. Then, it enhances multiview textures using depth-guided warping signals for cross-view consistency with the help of the image diffusion model. To handle expression animation, we incorporate a video prior with synchronized driving signals across viewpoints. We further introduce a Consistent-Inconsistent training to effectively handle data inconsistencies during 4D reconstruction. Experimental results demonstrate that our method achieves superior quality compared to the prior art, while maintaining consistency across different viewpoints and expressions.