Stereo World Model: Camera-Guided Stereo Video Generation

📅 2026-03-18

📈 Citations: 0

✨ Influential: 0

career value

243K/year

🤖 AI Summary

Existing monocular RGB or RGB-D methods struggle to simultaneously preserve binocular geometric consistency and camera motion fidelity when generating stereoscopic videos. This work proposes StereoWorld, an end-to-end camera-guided stereoscopic world model that operates solely on RGB inputs. By jointly learning appearance and binocular geometry, StereoWorld explicitly models scene structure through disparity. The method introduces rotation-based positional encoding (RoPE) in a unified camera frame and a stereo-aware attention decomposition mechanism, which factorizes 4D attention into 3D intra-view attention and horizontal row-wise attention, enabling disparity alignment via epipolar geometry. While retaining pretrained video priors, the model substantially reduces computational overhead. Experiments demonstrate that StereoWorld outperforms current monocular-to-stereo approaches across multiple benchmarks, achieving over 3× faster generation, a 5% improvement in view consistency, and enabling end-to-end VR rendering and embodied policy learning without explicit depth estimation.

Technology Category

Application Category

📝 Abstract

We present StereoWorld, a camera-conditioned stereo world model that jointly learns appearance and binocular geometry for end-to-end stereo video generation.Unlike monocular RGB or RGBD approaches, StereoWorld operates exclusively within the RGB modality, while simultaneously grounding geometry directly from disparity. To efficiently achieve consistent stereo generation, our approach introduces two key designs: (1) a unified camera-frame RoPE that augments latent tokens with camera-aware rotary positional encoding, enabling relative, view- and time-consistent conditioning while preserving pretrained video priors via a stable attention initialization; and (2) a stereo-aware attention decomposition that factors full 4D attention into 3D intra-view attention plus horizontal row attention, leveraging the epipolar prior to capture disparity-aligned correspondences with substantially lower compute. Across benchmarks, StereoWorld improves stereo consistency, disparity accuracy, and camera-motion fidelity over strong monocular-then-convert pipelines, achieving more than 3x faster generation with an additional 5% gain in viewpoint consistency. Beyond benchmarks, StereoWorld enables end-to-end binocular VR rendering without depth estimation or inpainting, enhances embodied policy learning through metric-scale depth grounding, and is compatible with long-video distillation for extended interactive stereo synthesis.

Problem

Research questions and friction points this paper is trying to address.

stereo video generation

binocular geometry

camera-conditioned modeling

disparity consistency

RGB modality

Innovation

Methods, ideas, or system contributions that make the work stand out.

Stereo Video Generation

Camera-Conditioned World Model

Rotary Positional Encoding