🤖 AI Summary
This work addresses monocular-to-stereoscopic image generation without explicit depth estimation or geometric warping. We propose an end-to-end diffusion-based paradigm that operates in a canonical rectified space, where view-conditioned embeddings directly model disparity distributions and occlusion-aware inpainting—enabling fully differentiable, parameterized training. To rigorously evaluate perceptual fidelity, we introduce a leakage-immune assessment protocol emphasizing downstream metrics: iSQoE (integrated Stereo Quality of Experience) and MEt3R (Multi-scale Edge-aware Temporal 3D Reconstruction error). Experiments demonstrate consistent superiority over warp-and-inpaint, latent-warping, and warped-conditioning baselines across both hierarchical and non-Lambertian scenes. Our method achieves state-of-the-art performance in disparity sharpness and geometric consistency, marking the first successful realization of high-fidelity stereo synthesis without depth prediction or explicit warping operations.
📝 Abstract
We introduce StereoSpace, a diffusion-based framework for monocular-to-stereo synthesis that models geometry purely through viewpoint conditioning, without explicit depth or warping. A canonical rectified space and the conditioning guide the generator to infer correspondences and fill disocclusions end-to-end. To ensure fair and leakage-free evaluation, we introduce an end-to-end protocol that excludes any ground truth or proxy geometry estimates at test time. The protocol emphasizes metrics reflecting downstream relevance: iSQoE for perceptual comfort and MEt3R for geometric consistency. StereoSpace surpasses other methods from the warp&inpaint, latent-warping, and warped-conditioning categories, achieving sharp parallax and strong robustness on layered and non-Lambertian scenes. This establishes viewpoint-conditioned diffusion as a scalable, depth-free solution for stereo generation.