M2SVid: End-to-End Inpainting and Refinement for Monocular-to-Stereo Video Conversion

๐Ÿ“… 2025-05-22
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the end-to-end monocular-to-stereoscopic video conversion problem. We propose a dual-view collaborative diffusion framework: first estimating depth from the input monocular video and warping to generate an initial right view; then jointly refining it using the left view, the warped right view, and a dynamic disocclusion mask within Stable Video Diffusion (SVD). Crucially, we adapt SVD for dual-view generation and introduce a full-frame attention mechanism that emphasizes non-occluded regions, substantially improving restoration quality in dynamic areas. Our method performs end-to-end optimization directly in image space, integrating depth guidance, multi-condition control, and temporal consistency constraints. In user studies, our approach achieves a mean ranking of 1.43โ€”top among four evaluated methodsโ€”and runs six times faster than the second-best method. Quantitative and qualitative evaluations demonstrate superior visual fidelity and motion coherence compared to state-of-the-art approaches.

Technology Category

Application Category

๐Ÿ“ Abstract
We tackle the problem of monocular-to-stereo video conversion and propose a novel architecture for inpainting and refinement of the warped right view obtained by depth-based reprojection of the input left view. We extend the Stable Video Diffusion (SVD) model to utilize the input left video, the warped right video, and the disocclusion masks as conditioning input to generate a high-quality right camera view. In order to effectively exploit information from neighboring frames for inpainting, we modify the attention layers in SVD to compute full attention for discoccluded pixels. Our model is trained to generate the right view video in an end-to-end manner by minimizing image space losses to ensure high-quality generation. Our approach outperforms previous state-of-the-art methods, obtaining an average rank of 1.43 among the 4 compared methods in a user study, while being 6x faster than the second placed method.
Problem

Research questions and friction points this paper is trying to address.

Convert monocular video to stereo video effectively
Inpaint and refine warped right view using depth
Enhance video quality with modified attention layers
Innovation

Methods, ideas, or system contributions that make the work stand out.

Extends Stable Video Diffusion for stereo conversion
Modifies attention layers for disoccluded pixels
Trains end-to-end with image space losses
๐Ÿ”Ž Similar Papers
No similar papers found.