DMS:Diffusion-Based Multi-Baseline Stereo Generation for Improving Self-Supervised Depth Estimation

📅 2025-08-18

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

Self-supervised monocular depth estimation using stereo image pairs often suffers from photometric reconstruction blur due to occlusions and disparity ambiguity, which disrupt pixel-wise correspondence. To address this, we propose a diffusion-based multi-baseline stereo generation framework: for the first time, we embed geometric priors into Stable Diffusion via directional prompt engineering along epipolar lines, enabling synthesis of left-, right-, and intermediate virtual views—thereby explicitly modeling cross-view correspondences without requiring ground-truth annotations. Our method is plug-and-play, requiring only unpaired stereo image pairs during both training and inference, effectively mitigating information loss in occluded regions. Extensive experiments demonstrate state-of-the-art performance in self-supervised depth estimation, achieving up to 35% reduction in outlier pixels across multiple benchmarks.

Technology Category

Application Category

📝 Abstract

While supervised stereo matching and monocular depth estimation have advanced significantly with learning-based algorithms, self-supervised methods using stereo images as supervision signals have received relatively less focus and require further investigation. A primary challenge arises from ambiguity introduced during photometric reconstruction, particularly due to missing corresponding pixels in ill-posed regions of the target view, such as occlusions and out-of-frame areas. To address this and establish explicit photometric correspondences, we propose DMS, a model-agnostic approach that utilizes geometric priors from diffusion models to synthesize novel views along the epipolar direction, guided by directional prompts. Specifically, we finetune a Stable Diffusion model to simulate perspectives at key positions: left-left view shifted from the left camera, right-right view shifted from the right camera, along with an additional novel view between the left and right cameras. These synthesized views supplement occluded pixels, enabling explicit photometric reconstruction. Our proposed DMS is a cost-free, ''plug-and-play'' method that seamlessly enhances self-supervised stereo matching and monocular depth estimation, and relies solely on unlabeled stereo image pairs for both training and synthesizing. Extensive experiments demonstrate the effectiveness of our approach, with up to 35% outlier reduction and state-of-the-art performance across multiple benchmark datasets.

Problem

Research questions and friction points this paper is trying to address.

Addresses ambiguity in photometric reconstruction for self-supervised depth estimation

Synthesizes novel views to supplement occluded pixels in stereo images

Enhances self-supervised stereo matching and monocular depth estimation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses diffusion models to synthesize novel views

Generates supplementary views to handle occlusions

Enhances depth estimation with geometric priors

🔎 Similar Papers

Self-supervised Monocular Depth Estimation Based on Hierarchical Feature-Guided Diffusion