🤖 AI Summary
Existing methods for stereo video generation rely on high-quality monocular input, rendering them ineffective for low-quality videos. This paper proposes the first end-to-end framework for stereo video generation and joint dual-view restoration tailored to degraded monocular videos, unifying stereo synthesis and collaborative inpainting within a single video diffusion model. Key innovations include optical-flow-guided view warping, warped mask conditioning, and degradation-aware fine-tuning—enabling effective training solely on small-scale synthetic data while generalizing robustly to real-world low-quality videos. Under low-resolution input conditions, our method significantly outperforms state-of-the-art approaches in quantitative metrics including SSIM and Stereo Consistency Score. To the best of our knowledge, it is the first to achieve high-fidelity, stereo-consistent video generation directly from severely degraded monocular inputs.
📝 Abstract
Stereo video generation has been gaining increasing attention with recent advancements in video diffusion models. However, most existing methods focus on generating 3D stereoscopic videos from monocular 2D videos. These approaches typically assume that the input monocular video is of high quality, making the task primarily about inpainting occluded regions in the warped video while preserving disoccluded areas. In this paper, we introduce a new pipeline that not only generates stereo videos but also enhances both left-view and right-view videos consistently with a single model. Our approach achieves this by fine-tuning the model on degraded data for restoration, as well as conditioning the model on warped masks for consistent stereo generation. As a result, our method can be fine-tuned on a relatively small synthetic stereo video datasets and applied to low-quality real-world videos, performing both stereo video generation and restoration. Experiments demonstrate that our method outperforms existing approaches both qualitatively and quantitatively in stereo video generation from low-resolution inputs.