MV-Performer: Taming Video Diffusion Model for Faithful and Synchronized Multi-view Performer Synthesis

📅 2025-10-08

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

Existing methods struggle to synthesize synchronized 360° multi-view human videos from monocular full-body video. This paper proposes MV-Performer, a multi-view human diffusion model tailored for human-centric 4D novel-view synthesis. Its core innovations are: (1) leveraging orientation-aware point-cloud-derived normal maps as camera-aware conditioning signals to ensure cross-view geometric consistency; and (2) designing a robust inference pipeline that mitigates artifacts induced by monocular depth estimation errors. The model jointly conditions the video diffusion process on reference video, partial renderings, and multi-view priors—guided explicitly by normal maps. Extensive experiments on three benchmarks—including MVHumanNet—demonstrate that MV-Performer significantly outperforms prior art in view consistency, temporal synchronization, and generation robustness. To our knowledge, it is the first method to achieve high-fidelity, omnidirectional, and temporally coherent novel-view human video synthesis.

Technology Category

Application Category

📝 Abstract

Recent breakthroughs in video generation, powered by large-scale datasets and diffusion techniques, have shown that video diffusion models can function as implicit 4D novel view synthesizers. Nevertheless, current methods primarily concentrate on redirecting camera trajectory within the front view while struggling to generate 360-degree viewpoint changes. In this paper, we focus on human-centric subdomain and present MV-Performer, an innovative framework for creating synchronized novel view videos from monocular full-body captures. To achieve a 360-degree synthesis, we extensively leverage the MVHumanNet dataset and incorporate an informative condition signal. Specifically, we use the camera-dependent normal maps rendered from oriented partial point clouds, which effectively alleviate the ambiguity between seen and unseen observations. To maintain synchronization in the generated videos, we propose a multi-view human-centric video diffusion model that fuses information from the reference video, partial rendering, and different viewpoints. Additionally, we provide a robust inference procedure for in-the-wild video cases, which greatly mitigates the artifacts induced by imperfect monocular depth estimation. Extensive experiments on three datasets demonstrate our MV-Performer's state-of-the-art effectiveness and robustness, setting a strong model for human-centric 4D novel view synthesis.

Problem

Research questions and friction points this paper is trying to address.

Generates synchronized multi-view videos from monocular human captures

Achieves 360-degree viewpoint synthesis using normal map conditions

Maintains temporal synchronization across generated novel view videos

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses camera-dependent normal maps from partial point clouds

Multi-view video diffusion model fuses reference and viewpoint data

Robust inference procedure reduces monocular depth estimation artifacts

🔎 Similar Papers

No similar papers found.