Zero-1-to-A: Zero-Shot One Image to Animatable Head Avatars Using Video Diffusion

📅 2025-03-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Zero-shot 4D head avatar generation from a single portrait image using video diffusion models suffers from spatiotemporal inconsistency and over-smoothing artifacts. Method: We propose a progressive spatiotemporal consistency learning paradigm that requires no training data or 3D priors. It employs a two-stage optimization: first fixing expression to learn multi-view geometry, then fixing viewpoint to learn dynamic expressions—integrated with score distillation sampling (SDS) and iterative pseudo-data construction to mitigate spatiotemporal distortion. Contribution/Results: Our key innovation lies in decoupling and sequentially modeling viewpoint and expression variations, significantly improving reconstruction fidelity, animation naturalness, and rendering efficiency. Experiments demonstrate high-quality, highly controllable 4D head avatars under zero-shot conditions, establishing a novel lightweight paradigm for drivable virtual human synthesis.

Technology Category

Application Category

📝 Abstract
Animatable head avatar generation typically requires extensive data for training. To reduce the data requirements, a natural solution is to leverage existing data-free static avatar generation methods, such as pre-trained diffusion models with score distillation sampling (SDS), which align avatars with pseudo ground-truth outputs from the diffusion model. However, directly distilling 4D avatars from video diffusion often leads to over-smooth results due to spatial and temporal inconsistencies in the generated video. To address this issue, we propose Zero-1-to-A, a robust method that synthesizes a spatial and temporal consistency dataset for 4D avatar reconstruction using the video diffusion model. Specifically, Zero-1-to-A iteratively constructs video datasets and optimizes animatable avatars in a progressive manner, ensuring that avatar quality increases smoothly and consistently throughout the learning process. This progressive learning involves two stages: (1) Spatial Consistency Learning fixes expressions and learns from front-to-side views, and (2) Temporal Consistency Learning fixes views and learns from relaxed to exaggerated expressions, generating 4D avatars in a simple-to-complex manner. Extensive experiments demonstrate that Zero-1-to-A improves fidelity, animation quality, and rendering speed compared to existing diffusion-based methods, providing a solution for lifelike avatar creation. Code is publicly available at: https://github.com/ZhenglinZhou/Zero-1-to-A.
Problem

Research questions and friction points this paper is trying to address.

Reduces data requirements for animatable head avatar generation.
Addresses over-smoothing in 4D avatar distillation from video diffusion.
Improves fidelity, animation quality, and rendering speed of avatars.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses video diffusion for 4D avatar synthesis
Iteratively constructs spatial-temporal consistent datasets
Progressive learning enhances avatar quality and speed
🔎 Similar Papers
No similar papers found.
Zhenglin Zhou
Zhenglin Zhou
Zhejiang University
Computer Vision
F
Fan Ma
ReLER, CCAI, Zhejiang University
F
Fan Hehe
ReLER, CCAI, Zhejiang University
C
Chua Tat-Seng
National University of Singapore