Elastic3D: Controllable Stereo Video Conversion with Guided Latent Decoding

๐Ÿ“… 2025-12-16
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the end-to-end controllable conversion of monocular video to stereo video, circumventing artifacts from explicit depth estimation and image warping. We propose a conditional latent diffusion model framework centered on a guidance-aware VAE decoder: it models disparity consistency directly in the latent space, ensuring geometric fidelity and sharpness; notably, it enables real-time stereo strength (i.e., disparity range) control via a single scalar parameter at inference timeโ€”a first in stereo video generation. Our method bypasses depth prediction and post-hoc warping, directly synthesizing high-fidelity stereo video. Evaluated on three real-world stereo video benchmarks, it significantly outperforms conventional depth-then-warping approaches and state-of-the-art warping-free baselines, achieving new SOTA performance in both visual quality and disparity consistency.

Technology Category

Application Category

๐Ÿ“ Abstract
The growing demand for immersive 3D content calls for automated monocular-to-stereo video conversion. We present Elastic3D, a controllable, direct end-to-end method for upgrading a conventional video to a binocular one. Our approach, based on (conditional) latent diffusion, avoids artifacts due to explicit depth estimation and warping. The key to its high-quality stereo video output is a novel, guided VAE decoder that ensures sharp and epipolar-consistent stereo video output. Moreover, our method gives the user control over the strength of the stereo effect (more precisely, the disparity range) at inference time, via an intuitive, scalar tuning knob. Experiments on three different datasets of real-world stereo videos show that our method outperforms both traditional warping-based and recent warping-free baselines and sets a new standard for reliable, controllable stereo video conversion. Please check the project page for the video samples https://elastic3d.github.io.
Problem

Research questions and friction points this paper is trying to address.

Automates monocular-to-stereo video conversion
Avoids artifacts from depth estimation and warping
Provides user control over stereo effect strength
Innovation

Methods, ideas, or system contributions that make the work stand out.

Latent diffusion model avoids depth estimation artifacts
Guided VAE decoder ensures sharp epipolar-consistent output
Scalar tuning knob enables user-controlled disparity range
๐Ÿ”Ž Similar Papers
No similar papers found.