EndoStreamDepth: Temporally Consistent Monocular Depth Estimation for Endoscopic Video Streams

📅 2025-12-19

📈 Citations: 0

✨ Influential: 0

career value

221K/year

🤖 AI Summary

Monocular depth estimation for endoscopic video streams suffers from a trade-off between anatomical boundary sharpness and inter-frame temporal consistency. To address this, we propose the first endoscopy-specific framework jointly modeling single-frame and temporal cues. Our method introduces an endoscopy-tailored image transformation network to enhance anatomical structure representation; incorporates multi-level Mamba-based temporal modules to explicitly capture long-range frame dependencies; and establishes a hierarchical multi-scale supervision scheme that jointly optimizes local boundary fidelity and global geometric coherence, augmented with complementary loss functions. Evaluated on two major colonoscopy depth datasets, our approach achieves significant improvements over state-of-the-art methods, producing depth maps with markedly sharper anatomical boundaries and superior temporal stability—critical for downstream applications such as surgical robotics. The source code is publicly available.

Technology Category

Application Category

📝 Abstract

This work presents EndoStreamDepth, a monocular depth estimation framework for endoscopic video streams. It provides accurate depth maps with sharp anatomical boundaries for each frame, temporally consistent predictions across frames, and real-time throughput. Unlike prior work that uses batched inputs, EndoStreamDepth processes individual frames with a temporal module to propagate inter-frame information. The framework contains three main components: (1) a single-frame depth network with endoscopy-specific transformation to produce accurate depth maps, (2) multi-level Mamba temporal modules that leverage inter-frame information to improve accuracy and stabilize predictions, and (3) a hierarchical design with comprehensive multi-scale supervision, where complementary loss terms jointly improve local boundary sharpness and global geometric consistency. We conduct comprehensive evaluations on two publicly available colonoscopy depth estimation datasets. Compared to state-of-the-art monocular depth estimation methods, EndoStreamDepth substantially improves performance, and it produces depth maps with sharp, anatomically aligned boundaries, which are essential to support downstream tasks such as automation for robotic surgery. The code is publicly available at https://github.com/MedICL-VU/EndoStreamDepth

Problem

Research questions and friction points this paper is trying to address.

Monocular depth estimation for endoscopic video streams

Ensuring temporal consistency across video frames

Producing sharp anatomical boundaries in depth maps

Innovation

Methods, ideas, or system contributions that make the work stand out.

Monocular depth estimation with temporal consistency

Multi-level Mamba modules for inter-frame information propagation

Hierarchical design with multi-scale supervision for sharp boundaries

🔎 Similar Papers

EndoPerfect: A Hybrid NeRF-Stereo Vision Approach Pioneering Monocular Depth Estimation and 3D Reconstruction in Endoscopy

2024-10-05Citations: 0

Bosch Group

Hildesheim, NDS, DE

Research Scientist Intern, Multimodal Generative AI and Robotics (PhD)