EndoStreamDepth: Temporally Consistent Monocular Depth Estimation for Endoscopic Video Streams

📅 2025-12-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Monocular depth estimation for endoscopic video streams suffers from a trade-off between anatomical boundary sharpness and inter-frame temporal consistency. To address this, we propose the first endoscopy-specific framework jointly modeling single-frame and temporal cues. Our method introduces an endoscopy-tailored image transformation network to enhance anatomical structure representation; incorporates multi-level Mamba-based temporal modules to explicitly capture long-range frame dependencies; and establishes a hierarchical multi-scale supervision scheme that jointly optimizes local boundary fidelity and global geometric coherence, augmented with complementary loss functions. Evaluated on two major colonoscopy depth datasets, our approach achieves significant improvements over state-of-the-art methods, producing depth maps with markedly sharper anatomical boundaries and superior temporal stability—critical for downstream applications such as surgical robotics. The source code is publicly available.

Technology Category

Application Category

📝 Abstract
This work presents EndoStreamDepth, a monocular depth estimation framework for endoscopic video streams. It provides accurate depth maps with sharp anatomical boundaries for each frame, temporally consistent predictions across frames, and real-time throughput. Unlike prior work that uses batched inputs, EndoStreamDepth processes individual frames with a temporal module to propagate inter-frame information. The framework contains three main components: (1) a single-frame depth network with endoscopy-specific transformation to produce accurate depth maps, (2) multi-level Mamba temporal modules that leverage inter-frame information to improve accuracy and stabilize predictions, and (3) a hierarchical design with comprehensive multi-scale supervision, where complementary loss terms jointly improve local boundary sharpness and global geometric consistency. We conduct comprehensive evaluations on two publicly available colonoscopy depth estimation datasets. Compared to state-of-the-art monocular depth estimation methods, EndoStreamDepth substantially improves performance, and it produces depth maps with sharp, anatomically aligned boundaries, which are essential to support downstream tasks such as automation for robotic surgery. The code is publicly available at https://github.com/MedICL-VU/EndoStreamDepth
Problem

Research questions and friction points this paper is trying to address.

Monocular depth estimation for endoscopic video streams
Ensuring temporal consistency across video frames
Producing sharp anatomical boundaries in depth maps
Innovation

Methods, ideas, or system contributions that make the work stand out.

Monocular depth estimation with temporal consistency
Multi-level Mamba modules for inter-frame information propagation
Hierarchical design with multi-scale supervision for sharp boundaries
🔎 Similar Papers
No similar papers found.