Online Video Depth Anything: Temporally-Consistent Depth Prediction with Low Memory Consumption

📅 2025-10-10

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

To address poor temporal consistency and high GPU memory consumption in monocular video online depth estimation, this paper proposes a lightweight and efficient online inference framework. Methodologically, we introduce a latent variable caching mechanism to reuse historical frame features, employ a frame-level random masking training strategy to mitigate temporal overfitting, and jointly optimize a lightweight network architecture with a temporal consistency loss. To our knowledge, this is the first approach achieving truly online (unidirectional streaming) depth estimation without compromising accuracy, while significantly reducing GPU memory footprint. Experiments demonstrate real-time performance: 42 FPS on an NVIDIA A100 and 20 FPS on a Jetson edge device. Our method achieves state-of-the-art accuracy on KITTI and NYUv2 benchmarks among online methods, reduces GPU memory consumption by 37%–52%, and ensures strong real-time capability, robustness, and practical deployability.

Technology Category

Application Category

📝 Abstract

Depth estimation from monocular video has become a key component of many real-world computer vision systems. Recently, Video Depth Anything (VDA) has demonstrated strong performance on long video sequences. However, it relies on batch-processing which prohibits its use in an online setting. In this work, we overcome this limitation and introduce online VDA (oVDA). The key innovation is to employ techniques from Large Language Models (LLMs), namely, caching latent features during inference and masking frames at training. Our oVDA method outperforms all competing online video depth estimation methods in both accuracy and VRAM usage. Low VRAM usage is particularly important for deployment on edge devices. We demonstrate that oVDA runs at 42 FPS on an NVIDIA A100 and at 20 FPS on an NVIDIA Jetson edge device. We will release both, code and compilation scripts, making oVDA easy to deploy on low-power hardware.

Problem

Research questions and friction points this paper is trying to address.

Enables real-time depth estimation for monocular video sequences

Reduces memory consumption for deployment on edge devices

Achieves temporal consistency without batch processing constraints

Innovation

Methods, ideas, or system contributions that make the work stand out.

Caches latent features during inference

Masks frames during training phase

Achieves low VRAM usage for edge devices

🔎 Similar Papers

FutureDepth: Learning to Predict the Future Improves Video Depth Estimation