Unifying Scale-Aware Depth Prediction and Perceptual Priors for Monocular Endoscope Pose Estimation and Tissue Reconstruction

📅 2025-08-15

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

Monocular endoscopic pose estimation and 3D reconstruction in minimally invasive surgery remain challenging due to depth ambiguity, dynamic tissue deformation, and texture scarcity. To address these issues, this paper proposes a unified framework that jointly leverages scale awareness and temporal awareness. We introduce MAPIS-Depth to generate pseudo-metric depth and WEMA-RTDL to jointly optimize camera pose. Temporal consistency is enhanced via LPIPS-adaptive fusion of optical flow and depth cues. The framework integrates RAFT optical flow, L-BFGS-B pose optimization, TSDF voxel fusion, and Marching Cubes surface reconstruction. Evaluated on the HEVD and SCARED benchmarks, our method achieves significant improvements in pose accuracy and dynamic tissue surface reconstruction quality, outperforming existing state-of-the-art approaches.

Technology Category

Application Category

📝 Abstract

Accurate endoscope pose estimation and 3D tissue surface reconstruction significantly enhances monocular minimally invasive surgical procedures by enabling accurate navigation and improved spatial awareness. However, monocular endoscope pose estimation and tissue reconstruction face persistent challenges, including depth ambiguity, physiological tissue deformation, inconsistent endoscope motion, limited texture fidelity, and a restricted field of view. To overcome these limitations, a unified framework for monocular endoscopic tissue reconstruction that integrates scale-aware depth prediction with temporally-constrained perceptual refinement is presented. This framework incorporates a novel MAPIS-Depth module, which leverages Depth Pro for robust initialisation and Depth Anything for efficient per-frame depth prediction, in conjunction with L-BFGS-B optimisation, to generate pseudo-metric depth estimates. These estimates are temporally refined by computing pixel correspondences using RAFT and adaptively blending flow-warped frames based on LPIPS perceptual similarity, thereby reducing artefacts arising from physiological tissue deformation and motion. To ensure accurate registration of the synthesised pseudo-RGBD frames from MAPIS-Depth, a novel WEMA-RTDL module is integrated, optimising both rotation and translation. Finally, truncated signed distance function-based volumetric fusion and marching cubes are applied to extract a comprehensive 3D surface mesh. Evaluations on HEVD and SCARED, with ablation and comparative analyses, demonstrate the framework's robustness and superiority over state-of-the-art methods.

Problem

Research questions and friction points this paper is trying to address.

Monocular endoscope pose estimation faces depth ambiguity and tissue deformation.

Limited texture fidelity and field of view hinder tissue reconstruction.

Inconsistent endoscope motion complicates accurate 3D surface reconstruction.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Scale-aware depth prediction with MAPIS-Depth module

Temporal refinement using RAFT and LPIPS blending

WEMA-RTDL for accurate rotation and translation

🔎 Similar Papers

Advancing Depth Anything Model for Unsupervised Monocular Depth Estimation in Endoscopy