S^2VG: 3D Stereoscopic and Spatial Video Generation via Denoising Frame Matrix

📅 2025-08-11

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

Current video generation models excel at monocular video synthesis but struggle to directly produce high-quality, spatiotemporally consistent 3D stereo and immersive spatial videos. To address this, we propose a general-purpose, pose-free, fine-tuning-free post-processing framework: it estimates depth from monocular videos, performs multi-view warping, and jointly optimizes occluded regions and spatiotemporal consistency in latent space via frame-matrix inpainting and a dual-update mechanism. We further integrate 4D Gaussian splatting optimization to enhance geometric fidelity. Our method is plug-and-play—compatible with leading monocular video generators including Sora, Lumiere, WALT, and Zeroscope—without architectural or training modifications. Extensive evaluations on multiple benchmarks demonstrate significant improvements over prior approaches, enabling high-fidelity stereo pair and immersive spatial video generation.

Technology Category

Application Category

📝 Abstract

While video generation models excel at producing high-quality monocular videos, generating 3D stereoscopic and spatial videos for immersive applications remains an underexplored challenge. We present a pose-free and training-free method that leverages an off-the-shelf monocular video generation model to produce immersive 3D videos. Our approach first warps the generated monocular video into pre-defined camera viewpoints using estimated depth information, then applies a novel extit{frame matrix} inpainting framework. This framework utilizes the original video generation model to synthesize missing content across different viewpoints and timestamps, ensuring spatial and temporal consistency without requiring additional model fine-tuning. Moreover, we develop a dualupdate~scheme that further improves the quality of video inpainting by alleviating the negative effects propagated from disoccluded areas in the latent space. The resulting multi-view videos are then adapted into stereoscopic pairs or optimized into 4D Gaussians for spatial video synthesis. We validate the efficacy of our proposed method by conducting experiments on videos from various generative models, such as Sora, Lumiere, WALT, and Zeroscope. The experiments demonstrate that our method has a significant improvement over previous methods. Project page at: https://daipengwa.github.io/S-2VG_ProjectPage/

Problem

Research questions and friction points this paper is trying to address.

Generating 3D stereoscopic and spatial videos for immersive applications

Leveraging monocular video generation models for 3D content creation

Ensuring spatial and temporal consistency across multiple viewpoints

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages off-the-shelf monocular video generation model

Uses frame matrix inpainting for multi-view consistency

Applies dual-update scheme to improve disocclusion handling

🔎 Similar Papers

SV4D: Dynamic 3D Content Generation with Multi-Frame and Multi-View Consistency