Reangle-A-Video: 4D Video Generation as Video-to-Video Translation

📅 2025-03-12

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

This work addresses the problem of single-input video-to-synchronous-multi-view video generation, proposing a video-to-video translation framework that operates without large-scale 4D training data. Methodologically, it decouples multi-view synthesis into two components: (1) self-supervised multi-view motion modeling, which jointly learns camera motion and scene dynamics; and (2) inference-time geometry-guided consistent image translation, integrating sparse 3D structure estimated by DUSt3R, optical flow-based warping, and mask-inpainting. The approach is built upon a video diffusion Transformer and employs a two-stage collaborative fine-tuning strategy. Contributions include state-of-the-art performance on static view synthesis and dynamic camera control tasks, achieving high-fidelity, temporally coherent multi-view video generation without 4D supervision—demonstrating significant improvements over prior methods.

Technology Category

Application Category

📝 Abstract

We introduce Reangle-A-Video, a unified framework for generating synchronized multi-view videos from a single input video. Unlike mainstream approaches that train multi-view video diffusion models on large-scale 4D datasets, our method reframes the multi-view video generation task as video-to-videos translation, leveraging publicly available image and video diffusion priors. In essence, Reangle-A-Video operates in two stages. (1) Multi-View Motion Learning: An image-to-video diffusion transformer is synchronously fine-tuned in a self-supervised manner to distill view-invariant motion from a set of warped videos. (2) Multi-View Consistent Image-to-Images Translation: The first frame of the input video is warped and inpainted into various camera perspectives under an inference-time cross-view consistency guidance using DUSt3R, generating multi-view consistent starting images. Extensive experiments on static view transport and dynamic camera control show that Reangle-A-Video surpasses existing methods, establishing a new solution for multi-view video generation. We will publicly release our code and data. Project page: https://hyeonho99.github.io/reangle-a-video/

Problem

Research questions and friction points this paper is trying to address.

Generates synchronized multi-view videos from single input

Reframes multi-view video generation as video-to-videos translation

Achieves multi-view consistency using image-to-video diffusion transformers

Innovation

Methods, ideas, or system contributions that make the work stand out.

Video-to-video translation for multi-view generation

Self-supervised multi-view motion learning

Cross-view consistency guidance using DUSt3R

🔎 Similar Papers

SV4D: Dynamic 3D Content Generation with Multi-Frame and Multi-View Consistency

2024-07-24arXiv.orgCitations: 15

TikTok

San Jose, California

Sr. Research Engineer/Scientist (all levels), World Models

TikTok

San Jose, California

AI Research Scientist, Computer Vision - Facebook Video Intelligence