Bridging Spectral-wise and Multi-spectral Depth Estimation via Geometry-guided Contrastive Learning

๐Ÿ“… 2025-03-02
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address the limited robustness, high memory overhead, and poor deployment flexibility of multi-spectral depth estimation in autonomous driving, this paper proposes an โ€œalignment-fusionโ€ two-stage framework. First, a geometry-guided cross-spectral contrastive learning mechanism is designed to achieve spatial and semantic alignment of RGB, NIR, and thermal features. Second, a lightweight, plug-and-play fusion module is introduced, supporting both spectral-invariant depth estimation and adaptive multi-spectral fusion. Notably, this work is the first to incorporate explicit geometric constraints into cross-modal contrastive learning, jointly modeling global consistency and local geometric structure. The method achieves state-of-the-art performance on multiple multi-spectral depth benchmarks, reducing estimation error by 22% under adverse weather and low-light conditions, accelerating inference by 40%, and decreasing parameter count by 35%.

Technology Category

Application Category

๐Ÿ“ Abstract
Deploying depth estimation networks in the real world requires high-level robustness against various adverse conditions to ensure safe and reliable autonomy. For this purpose, many autonomous vehicles employ multi-modal sensor systems, including an RGB camera, NIR camera, thermal camera, LiDAR, or Radar. They mainly adopt two strategies to use multiple sensors: modality-wise and multi-modal fused inference. The former method is flexible but memory-inefficient, unreliable, and vulnerable. Multi-modal fusion can provide high-level reliability, yet it needs a specialized architecture. In this paper, we propose an effective solution, named align-and-fuse strategy, for the depth estimation from multi-spectral images. In the align stage, we align embedding spaces between multiple spectrum bands to learn shareable representation across multi-spectral images by minimizing contrastive loss of global and spatially aligned local features with geometry cue. After that, in the fuse stage, we train an attachable feature fusion module that can selectively aggregate the multi-spectral features for reliable and robust prediction results. Based on the proposed method, a single-depth network can achieve both spectral-invariant and multi-spectral fused depth estimation while preserving reliability, memory efficiency, and flexibility.
Problem

Research questions and friction points this paper is trying to address.

Improves depth estimation robustness across multi-spectral images.
Addresses memory inefficiency and unreliability in multi-modal sensor systems.
Enables spectral-invariant and fused depth estimation with a single network.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Align embedding spaces across multi-spectral bands
Minimize contrastive loss with geometry-guided features
Attachable feature fusion module for selective aggregation
๐Ÿ”Ž Similar Papers
No similar papers found.