MVP4D: Multi-View Portrait Video Diffusion for Animatable 4D Avatars

📅 2025-10-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Traditional single-image-driven digital human generation methods suffer from limited view generalization, temporal incoherence, and low rendering efficiency. To address these challenges, we propose MVP4D—a novel framework that pioneers the adaptation of pre-trained video diffusion models for multi-view-consistent 4D portrait video synthesis from a single reference image. Leveraging 3D-aware knowledge distillation and differentiable rendering, MVP4D constructs an explicit, real-time renderable dynamic 4D avatar. Our method jointly enforces multi-view consistency and dynamic appearance modeling—enabling high-fidelity, temporally stable, and full 360° free-viewpoint synthesis without requiring multi-view inputs. Extensive experiments demonstrate that MVP4D significantly outperforms state-of-the-art approaches in cross-view fidelity, temporal coherence, and perceptual quality, while substantially reducing the cost and technical barriers to high-quality digital human creation.

Technology Category

Application Category

📝 Abstract
Digital human avatars aim to simulate the dynamic appearance of humans in virtual environments, enabling immersive experiences across gaming, film, virtual reality, and more. However, the conventional process for creating and animating photorealistic human avatars is expensive and time-consuming, requiring large camera capture rigs and significant manual effort from professional 3D artists. With the advent of capable image and video generation models, recent methods enable automatic rendering of realistic animated avatars from a single casually captured reference image of a target subject. While these techniques significantly lower barriers to avatar creation and offer compelling realism, they lack constraints provided by multi-view information or an explicit 3D representation. So, image quality and realism degrade when rendered from viewpoints that deviate strongly from the reference image. Here, we build a video model that generates animatable multi-view videos of digital humans based on a single reference image and target expressions. Our model, MVP4D, is based on a state-of-the-art pre-trained video diffusion model and generates hundreds of frames simultaneously from viewpoints varying by up to 360 degrees around a target subject. We show how to distill the outputs of this model into a 4D avatar that can be rendered in real-time. Our approach significantly improves the realism, temporal consistency, and 3D consistency of generated avatars compared to previous methods.
Problem

Research questions and friction points this paper is trying to address.

Creating realistic animatable 4D avatars from single reference images
Addressing viewpoint degradation without multi-view constraints
Improving temporal and 3D consistency in avatar generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates multi-view videos from single image
Distills video outputs into real-time 4D avatar
Uses pre-trained video diffusion model for generation
🔎 Similar Papers
No similar papers found.
F
Felix Taubner
University of Toronto and Vector Institute, Canada
R
Ruihang Zhang
University of Toronto, Canada
M
Mathieu Tuli
LG Electronics, Canada
Sherwin Bahmani
Sherwin Bahmani
University of Toronto
Computer VisionComputer GraphicsMachine Learning
D
David B. Lindell
University of Toronto and Vector Institute, Canada