AsyncMDE: Real-Time Monocular Depth Estimation via Asynchronous Spatial Memory

📅 2026-03-11

📈 Citations: 0

✨ Influential: 0

career value

220K/year

🤖 AI Summary

This work addresses the high computational cost of foundation models for monocular depth estimation, which hinders their real-time deployment on edge devices, and the underutilization of inter-frame redundancy in existing approaches. To overcome these limitations, we propose AsyncMDE, the first method to introduce an asynchronous spatial memory mechanism. It decouples computation by leveraging a large background model to generate high-quality spatial features while a lightweight foreground model performs real-time inference. Cross-frame feature reuse is enabled through an autoregressive memory update and a complementary fusion strategy. With only 3.83 million parameters, AsyncMDE achieves 237 FPS on an RTX 4090—recovering 77% of the foundation model’s accuracy—and 161 FPS on a Jetson AGX Orin, significantly outperforming current state-of-the-art methods.

Technology Category

Application Category

📝 Abstract

Foundation-model-based monocular depth estimation offers a viable alternative to active sensors for robot perception, yet its computational cost often prohibits deployment on edge platforms. Existing methods perform independent per-frame inference, wasting the substantial computational redundancy between adjacent viewpoints in continuous robot operation. This paper presents AsyncMDE, an asynchronous depth perception system consisting of a foundation model and a lightweight model that amortizes the foundation model's computational cost over time. The foundation model produces high-quality spatial features in the background, while the lightweight model runs asynchronously in the foreground, fusing cached memory with current observations through complementary fusion, outputting depth estimates, and autoregressively updating the memory. This enables cross-frame feature reuse with bounded accuracy degradation. At a mere 3.83M parameters, it operates at 237 FPS on an RTX 4090, recovering 77% of the accuracy gap to the foundation model while achieving a 25X parameter reduction. Validated across indoor static, dynamic, and synthetic extreme-motion benchmarks, AsyncMDE degrades gracefully between refreshes and achieves 161FPS on a Jetson AGX Orin with TensorRT, clearly demonstrating its feasibility for real-time edge deployment.

Problem

Research questions and friction points this paper is trying to address.

monocular depth estimation

real-time perception

edge deployment

computational redundancy

foundation model

Innovation

Methods, ideas, or system contributions that make the work stand out.

asynchronous inference

monocular depth estimation

spatial memory