π€ AI Summary
Monocular endoscopic pose estimation and 3D reconstruction in minimally invasive surgery remain challenging due to depth ambiguity, dynamic tissue deformation, and texture scarcity. To address these issues, this paper proposes a unified framework that jointly leverages scale awareness and temporal awareness. We introduce MAPIS-Depth to generate pseudo-metric depth and WEMA-RTDL to jointly optimize camera pose. Temporal consistency is enhanced via LPIPS-adaptive fusion of optical flow and depth cues. The framework integrates RAFT optical flow, L-BFGS-B pose optimization, TSDF voxel fusion, and Marching Cubes surface reconstruction. Evaluated on the HEVD and SCARED benchmarks, our method achieves significant improvements in pose accuracy and dynamic tissue surface reconstruction quality, outperforming existing state-of-the-art approaches.
π Abstract
Accurate endoscope pose estimation and 3D tissue surface reconstruction significantly enhances monocular minimally invasive surgical procedures by enabling accurate navigation and improved spatial awareness. However, monocular endoscope pose estimation and tissue reconstruction face persistent challenges, including depth ambiguity, physiological tissue deformation, inconsistent endoscope motion, limited texture fidelity, and a restricted field of view. To overcome these limitations, a unified framework for monocular endoscopic tissue reconstruction that integrates scale-aware depth prediction with temporally-constrained perceptual refinement is presented. This framework incorporates a novel MAPIS-Depth module, which leverages Depth Pro for robust initialisation and Depth Anything for efficient per-frame depth prediction, in conjunction with L-BFGS-B optimisation, to generate pseudo-metric depth estimates. These estimates are temporally refined by computing pixel correspondences using RAFT and adaptively blending flow-warped frames based on LPIPS perceptual similarity, thereby reducing artefacts arising from physiological tissue deformation and motion. To ensure accurate registration of the synthesised pseudo-RGBD frames from MAPIS-Depth, a novel WEMA-RTDL module is integrated, optimising both rotation and translation. Finally, truncated signed distance function-based volumetric fusion and marching cubes are applied to extract a comprehensive 3D surface mesh. Evaluations on HEVD and SCARED, with ablation and comparative analyses, demonstrate the framework's robustness and superiority over state-of-the-art methods.