$D^3$-RSMDE: 40$\times$ Faster and High-Fidelity Remote Sensing Monocular Depth Estimation

📅 2026-03-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of balancing accuracy and efficiency in monocular depth estimation for remote sensing imagery by proposing a structure prior–guided diffusion refinement mechanism. The approach first leverages a Vision Transformer to rapidly generate a global structural prior, then performs a small number of lightweight iterative refinements within the latent space of a variational autoencoder (VAE) using a compact U-Net architecture, enhanced by a progressive linear fusion strategy to optimize fine details. The method achieves significantly improved perceptual quality and inference speed while maintaining low memory consumption comparable to that of lightweight ViT models—reducing LPIPS perceptual error by 11.85% and accelerating inference by over 40× compared to state-of-the-art models such as Marigold.

Technology Category

Application Category

📝 Abstract
Real-time, high-fidelity monocular depth estimation from remote sensing imagery is crucial for numerous applications, yet existing methods face a stark trade-off between accuracy and efficiency. Although using Vision Transformer (ViT) backbones for dense prediction is fast, they often exhibit poor perceptual quality. Conversely, diffusion models offer high fidelity but at a prohibitive computational cost. To overcome these limitations, we propose Depth Detail Diffusion for Remote Sensing Monocular Depth Estimation ($D^3$-RSMDE), an efficient framework designed to achieve an optimal balance between speed and quality. Our framework first leverages a ViT-based module to rapidly generate a high-quality preliminary depth map construction, which serves as a structural prior, effectively replacing the time-consuming initial structure generation stage of diffusion models. Based on this prior, we propose a Progressive Linear Blending Refinement (PLBR) strategy, which uses a lightweight U-Net to refine the details in only a few iterations. The entire refinement step operates efficiently in a compact latent space supported by a Variational Autoencoder (VAE). Extensive experiments demonstrate that $D^3$-RSMDE achieves a notable 11.85% reduction in the Learned Perceptual Image Patch Similarity (LPIPS) perceptual metric over leading models like Marigold, while also achieving over a 40x speedup in inference and maintaining VRAM usage comparable to lightweight ViT models.
Problem

Research questions and friction points this paper is trying to address.

monocular depth estimation
remote sensing
accuracy-efficiency trade-off
diffusion models
Vision Transformer
Innovation

Methods, ideas, or system contributions that make the work stand out.

Monocular Depth Estimation
Vision Transformer
Diffusion Model
Progressive Linear Blending Refinement
Remote Sensing
🔎 Similar Papers
No similar papers found.
R
Ruizhi Wang
School of Software Technology, Zhejiang University
Weihan Li
Weihan Li
Eastern Institute of Technology, Ningbo
Energy storage materialsSynchrotron characterizations
Z
Zunlei Feng
School of Software Technology, Zhejiang University; State Key Laboratory of Blockchain and Data Security, Zhejiang University
H
Haofei Zhang
State Key Laboratory of Blockchain and Data Security, Zhejiang University; Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security
M
Mingli Song
School of Software Technology, Zhejiang University; State Key Laboratory of Blockchain and Data Security, Zhejiang University; Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security
Jiayu Wang
Jiayu Wang
Beihang University & Jiangnan University & The University of Auckland
Soft sensordata drivenfault detectionprocess monitoring
Jie Song
Jie Song
Professor, University of Massachusetts Chan Medical School
biomaterialsregenerative medicine
Li Sun
Li Sun
Ningbo Innovation Center, Zhejiang University
computer visionroboticsmachine learningartificial intelligence