🤖 AI Summary
Zero-shot 4D head avatar generation from a single portrait image using video diffusion models suffers from spatiotemporal inconsistency and over-smoothing artifacts. Method: We propose a progressive spatiotemporal consistency learning paradigm that requires no training data or 3D priors. It employs a two-stage optimization: first fixing expression to learn multi-view geometry, then fixing viewpoint to learn dynamic expressions—integrated with score distillation sampling (SDS) and iterative pseudo-data construction to mitigate spatiotemporal distortion. Contribution/Results: Our key innovation lies in decoupling and sequentially modeling viewpoint and expression variations, significantly improving reconstruction fidelity, animation naturalness, and rendering efficiency. Experiments demonstrate high-quality, highly controllable 4D head avatars under zero-shot conditions, establishing a novel lightweight paradigm for drivable virtual human synthesis.
📝 Abstract
Animatable head avatar generation typically requires extensive data for training. To reduce the data requirements, a natural solution is to leverage existing data-free static avatar generation methods, such as pre-trained diffusion models with score distillation sampling (SDS), which align avatars with pseudo ground-truth outputs from the diffusion model. However, directly distilling 4D avatars from video diffusion often leads to over-smooth results due to spatial and temporal inconsistencies in the generated video. To address this issue, we propose Zero-1-to-A, a robust method that synthesizes a spatial and temporal consistency dataset for 4D avatar reconstruction using the video diffusion model. Specifically, Zero-1-to-A iteratively constructs video datasets and optimizes animatable avatars in a progressive manner, ensuring that avatar quality increases smoothly and consistently throughout the learning process. This progressive learning involves two stages: (1) Spatial Consistency Learning fixes expressions and learns from front-to-side views, and (2) Temporal Consistency Learning fixes views and learns from relaxed to exaggerated expressions, generating 4D avatars in a simple-to-complex manner. Extensive experiments demonstrate that Zero-1-to-A improves fidelity, animation quality, and rendering speed compared to existing diffusion-based methods, providing a solution for lifelike avatar creation. Code is publicly available at: https://github.com/ZhenglinZhou/Zero-1-to-A.