Zero-shot Monocular Metric Depth for Endoscopic Images

📅 2025-09-23

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

The endoscopic monocular depth estimation community faces two critical challenges: scarcity of high-quality annotated data and absence of reliable clinical generalization benchmarks. To address these, we propose the first zero-shot metric depth evaluation framework designed for previously unseen real-world scenarios, alongside EndoSynth—a novel synthetic dataset featuring photorealistic surgical instrument images, ground-truth metric depth maps, and pixel-level segmentation masks. Leveraging a Transformer-based architecture, we introduce an efficient synthetic-to-real transfer strategy that enables model fine-tuning without requiring real-world depth annotations. Experimental results demonstrate that models fine-tuned on EndoSynth achieve substantial improvements in metric depth accuracy across diverse unseen endoscopic videos—reducing mean relative error by 32.7% on average. This work provides the first systematic validation of synthetic-data-driven approaches for clinical-grade depth perception, establishing both their effectiveness and generalizability in real surgical settings.

Technology Category

Application Category

📝 Abstract

Monocular relative and metric depth estimation has seen a tremendous boost in the last few years due to the sharp advancements in foundation models and in particular transformer based networks. As we start to see applications to the domain of endoscopic images, there is still a lack of robust benchmarks and high-quality datasets in that area. This paper addresses these limitations by presenting a comprehensive benchmark of state-of-the-art (metric and relative) depth estimation models evaluated on real, unseen endoscopic images, providing critical insights into their generalisation and performance in clinical scenarios. Additionally, we introduce and publish a novel synthetic dataset (EndoSynth) of endoscopic surgical instruments paired with ground truth metric depth and segmentation masks, designed to bridge the gap between synthetic and real-world data. We demonstrate that fine-tuning depth foundation models using our synthetic dataset boosts accuracy on most unseen real data by a significant margin. By providing both a benchmark and a synthetic dataset, this work advances the field of depth estimation for endoscopic images and serves as an important resource for future research. Project page, EndoSynth dataset and trained weights are available at https://github.com/TouchSurgery/EndoSynth.

Problem

Research questions and friction points this paper is trying to address.

Lack of robust benchmarks for endoscopic depth estimation

Absence of high-quality datasets for endoscopic images

Need to bridge synthetic-real data gap in endoscopy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmarked depth models on real endoscopic images

Introduced synthetic dataset with ground truth

Fine-tuned foundation models with synthetic data

🔎 Similar Papers

EndoPerfect: A Hybrid NeRF-Stereo Vision Approach Pioneering Monocular Depth Estimation and 3D Reconstruction in Endoscopy

2024-10-05Citations: 0