Zero4D: Training-Free 4D Video Generation From Single Video Using Off-the-Shelf Video Diffusion Model

📅 2025-03-28

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

This paper introduces the first training-free single-video-to-multi-view 4D video generation method, addressing the challenge of preserving spatiotemporal consistency under zero-shot conditions. The approach adopts a two-stage paradigm: (1) coarse multi-view frame synthesis via depth- and optical-flow-guided edge-frame warping; and (2) high-fidelity 4D reconstruction through zero-shot adaptation of video diffusion models, enabled by spatiotemporal sampling grid modeling and consistency-aware interpolation. Crucially, the method requires no fine-tuning, additional training, or domain-specific datasets—only off-the-shelf video diffusion models. Experiments demonstrate that it achieves high-fidelity, spatially aligned, and temporally coherent 4D videos while significantly reducing computational overhead and data dependency. Quantitative evaluation shows superior performance over existing zero-shot baselines across multiple metrics.

Technology Category

Application Category

📝 Abstract

Recently, multi-view or 4D video generation has emerged as a significant research topic. Nonetheless, recent approaches to 4D generation still struggle with fundamental limitations, as they primarily rely on harnessing multiple video diffusion models with additional training or compute-intensive training of a full 4D diffusion model with limited real-world 4D data and large computational costs. To address these challenges, here we propose the first training-free 4D video generation method that leverages the off-the-shelf video diffusion models to generate multi-view videos from a single input video. Our approach consists of two key steps: (1) By designating the edge frames in the spatio-temporal sampling grid as key frames, we first synthesize them using a video diffusion model, leveraging a depth-based warping technique for guidance. This approach ensures structural consistency across the generated frames, preserving spatial and temporal coherence. (2) We then interpolate the remaining frames using a video diffusion model, constructing a fully populated and temporally coherent sampling grid while preserving spatial and temporal consistency. Through this approach, we extend a single video into a multi-view video along novel camera trajectories while maintaining spatio-temporal consistency. Our method is training-free and fully utilizes an off-the-shelf video diffusion model, offering a practical and effective solution for multi-view video generation.

Problem

Research questions and friction points this paper is trying to address.

Training-free 4D video generation from single video

Leveraging off-the-shelf video diffusion models

Ensuring spatio-temporal consistency in multi-view videos

Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free 4D video generation method

Uses off-the-shelf video diffusion models

Depth-based warping for structural consistency

🔎 Similar Papers

SV4D: Dynamic 3D Content Generation with Multi-Frame and Multi-View Consistency