BRIDGE - Building Reinforcement-Learning Depth-to-Image Data Generation Engine for Monocular Depth Estimation

📅 2025-09-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Monocular depth estimation (MDE) suffers from limited availability and poor quality of ground-truth depth annotations, severely constraining model robustness and cross-domain generalization. Method: We propose a reinforcement learning (RL)-driven depth-to-image (D2I) generation framework featuring a novel RL-optimized GAN generator that integrates autoregressive priors and geometric consistency constraints to synthesize over 20 million geometrically accurate, high-fidelity depth-image pairs. We further introduce a hybrid supervision paradigm combining teacher-generated pseudo-labels with sparse real depth annotations, enhanced by knowledge distillation and multi-stage loss optimization. Contribution/Results: Our approach achieves significant improvements over state-of-the-art methods across multiple benchmarks, particularly excelling in modeling complex structures and recovering fine-grained details. It substantially enhances cross-domain generalization and robustness under domain shift and challenging scene geometries.

Technology Category

Application Category

📝 Abstract
Monocular Depth Estimation (MDE) is a foundational task for computer vision. Traditional methods are limited by data scarcity and quality, hindering their robustness. To overcome this, we propose BRIDGE, an RL-optimized depth-to-image (D2I) generation framework that synthesizes over 20M realistic and geometrically accurate RGB images, each intrinsically paired with its ground truth depth, from diverse source depth maps. Then we train our depth estimation model on this dataset, employing a hybrid supervision strategy that integrates teacher pseudo-labels with ground truth depth for comprehensive and robust training. This innovative data generation and training paradigm enables BRIDGE to achieve breakthroughs in scale and domain diversity, consistently outperforming existing state-of-the-art approaches quantitatively and in complex scene detail capture, thereby fostering general and robust depth features. Code and models are available at https://dingning-liu.github.io/bridge.github.io/.
Problem

Research questions and friction points this paper is trying to address.

Overcoming data scarcity in monocular depth estimation tasks
Generating realistic RGB-depth pairs with geometric accuracy
Improving depth feature robustness across diverse domains
Innovation

Methods, ideas, or system contributions that make the work stand out.

RL-optimized depth-to-image generation framework
Hybrid supervision with teacher pseudo-labels
Synthesizes 20M realistic RGB-depth image pairs
🔎 Similar Papers
No similar papers found.
Dingning Liu
Dingning Liu
Fudan University
MultimodalReinforcement Learning3D GenerationRoboticsAI4Science
Haoyu Guo
Haoyu Guo
Shanghai AI Lab
Computer Vision3D Vision
J
Jingyi Zhou
Shanghai Artificial Intelligence Laboratory
T
Tong He
Shanghai Artificial Intelligence Laboratory