Eye2Eye: A Simple Approach for Monocular-to-Stereo Video Synthesis

📅 2025-04-30

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

Monocular-to-stereoscopic video generation suffers from severe artifacts—such as ghosting and pixel misalignment—when reconstructing specular or transparent objects, due to limitations of conventional single-layer disparity estimation. To address this, we propose the first end-to-end, single-stage stereoscopic video synthesis framework that bypasses intermediate steps like depth estimation, geometric modeling, and image inpainting, directly synthesizing left- and right-view frames. Our method leverages a frozen text-to-video diffusion model, augmented with viewpoint-conditioned fine-tuning and spatiotemporal attention guidance, implicitly harnessing geometric and material priors embedded in the pre-trained model—without requiring explicit 3D representations. Evaluated on real-world complex scenes, our approach significantly suppresses artifacts, improves disparity consistency and visual fidelity, and demonstrates strong robustness to highlights and semi-transparent objects. Code and demonstration videos are publicly available.

Technology Category

Application Category

📝 Abstract

The rising popularity of immersive visual experiences has increased interest in stereoscopic 3D video generation. Despite significant advances in video synthesis, creating 3D videos remains challenging due to the relative scarcity of 3D video data. We propose a simple approach for transforming a text-to-video generator into a video-to-stereo generator. Given an input video, our framework automatically produces the video frames from a shifted viewpoint, enabling a compelling 3D effect. Prior and concurrent approaches for this task typically operate in multiple phases, first estimating video disparity or depth, then warping the video accordingly to produce a second view, and finally inpainting the disoccluded regions. This approach inherently fails when the scene involves specular surfaces or transparent objects. In such cases, single-layer disparity estimation is insufficient, resulting in artifacts and incorrect pixel shifts during warping. Our work bypasses these restrictions by directly synthesizing the new viewpoint, avoiding any intermediate steps. This is achieved by leveraging a pre-trained video model's priors on geometry, object materials, optics, and semantics, without relying on external geometry models or manually disentangling geometry from the synthesis process. We demonstrate the advantages of our approach in complex, real-world scenarios featuring diverse object materials and compositions. See videos on https://video-eye2eye.github.io

Problem

Research questions and friction points this paper is trying to address.

Transforming monocular videos into stereoscopic 3D videos

Bypassing multi-phase disparity estimation and warping

Handling complex scenes with specular or transparent objects

Innovation

Methods, ideas, or system contributions that make the work stand out.

Transforms text-to-video into video-to-stereo generator

Directly synthesizes new viewpoint without intermediate steps

Leverages pre-trained video model for geometry and semantics

🔎 Similar Papers

No similar papers found.