Monocular absolute depth estimation from endoscopy via domain-invariant feature learning and latent consistency

📅 2025-11-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Monocular absolute depth estimation from endoscopic videos is hindered by the scarcity of real-world depth annotations and the substantial domain gap between synthetic and real endoscopic images, which limits the effectiveness of existing unsupervised domain adaptation methods. To address this, we propose a latent-space feature alignment framework that jointly optimizes domain-invariant feature extraction and directional feature consistency via adversarial learning—without requiring explicit image translation. This approach effectively mitigates inter-domain discrepancies while preserving geometric structure. Evaluated on central airway endoscopy data, our method significantly improves both absolute and relative depth estimation accuracy. It demonstrates consistent performance gains across diverse backbone architectures (e.g., ResNet, EfficientNet) and pretraining regimes (e.g., ImageNet, self-supervised). The resulting robust depth perception capability advances autonomous navigation for medical robotics in minimally invasive procedures.

Technology Category

Application Category

📝 Abstract
Monocular depth estimation (MDE) is a critical task to guide autonomous medical robots. However, obtaining absolute (metric) depth from an endoscopy camera in surgical scenes is difficult, which limits supervised learning of depth on real endoscopic images. Current image-level unsupervised domain adaptation methods translate synthetic images with known depth maps into the style of real endoscopic frames and train depth networks using these translated images with their corresponding depth maps. However a domain gap often remains between real and translated synthetic images. In this paper, we present a latent feature alignment method to improve absolute depth estimation by reducing this domain gap in the context of endoscopic videos of the central airway. Our methods are agnostic to the image translation process and focus on the depth estimation itself. Specifically, the depth network takes translated synthetic and real endoscopic frames as input and learns latent domain-invariant features via adversarial learning and directional feature consistency. The evaluation is conducted on endoscopic videos of central airway phantoms with manually aligned absolute depth maps. Compared to state-of-the-art MDE methods, our approach achieves superior performance on both absolute and relative depth metrics, and consistently improves results across various backbones and pretrained weights. Our code is available at https://github.com/MedICL-VU/MDE.
Problem

Research questions and friction points this paper is trying to address.

Estimating absolute depth from monocular endoscopy images in surgical environments
Reducing domain gap between synthetic and real endoscopic video data
Improving metric depth prediction accuracy for medical robot guidance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Domain-invariant feature learning via adversarial training
Latent consistency for depth estimation in endoscopy
Feature alignment agnostic to image translation process
H
Hao Li
Vanderbilt University
D
Daiwei Lu
Vanderbilt University
J
Jesse F. d'Almeida
Vanderbilt University
D
Dilara Isik
Vanderbilt University
Ehsan Khodapanah Aghdam
Ehsan Khodapanah Aghdam
Unknown affiliation
N
Nick DiSanto
Vanderbilt University
Ayberk Acar
Ayberk Acar
Computer Science Ph.D. Student, Vanderbilt University
Medical ImagingSurgical RoboticsExtended RealityComputer Vision
S
Susheela Sharma
Vanderbilt University
Jie Ying Wu
Jie Ying Wu
Assistant Professor in CS, Vanderbilt University
Medical RoboticsModelling and SimulationMachine LearningTelerobotics
R
Robert J. Webster
Vanderbilt University
I
I. Oguz
Vanderbilt University