🤖 AI Summary
Endoscopic submucosal dissection (ESD) poses challenges for surgical phase recognition due to high visual similarity among phases and insufficient structural cues in RGB images. Method: This work introduces depth maps as auxiliary modality for the first time, proposing a depth-guided geometric prior generation module and a geometry-enhanced multi-scale cross-attention mechanism to enable structure-aware representation learning. Built upon a reparameterizable RepVGG backbone, the method fuses RGB and depth modalities to explicitly encode anatomical geometric constraints. Results: Evaluated on a custom nine-phase ESD dataset, the approach achieves state-of-the-art performance, significantly improving robustness and generalization while maintaining low computational overhead—meeting clinical requirements for real-time assistance. Contribution: This is the first study to incorporate depth information into minimally invasive surgical phase recognition and to establish an end-to-end geometrically aware recognition framework.
📝 Abstract
Surgical phase recognition plays a critical role in developing intelligent assistance systems for minimally invasive procedures such as Endoscopic Submucosal Dissection (ESD). However, the high visual similarity across different phases and the lack of structural cues in RGB images pose significant challenges. Depth information offers valuable geometric cues that can complement appearance features by providing insights into spatial relationships and anatomical structures. In this paper, we pioneer the use of depth information for surgical phase recognition and propose Geo-RepNet, a geometry-aware convolutional framework that integrates RGB image and depth information to enhance recognition performance in complex surgical scenes. Built upon a re-parameterizable RepVGG backbone, Geo-RepNet incorporates the Depth-Guided Geometric Prior Generation (DGPG) module that extracts geometry priors from raw depth maps, and the Geometry-Enhanced Multi-scale Attention (GEMA) to inject spatial guidance through geometry-aware cross-attention and efficient multi-scale aggregation. To evaluate the effectiveness of our approach, we construct a nine-phase ESD dataset with dense frame-level annotations from real-world ESD videos. Extensive experiments on the proposed dataset demonstrate that Geo-RepNet achieves state-of-the-art performance while maintaining robustness and high computational efficiency under complex and low-texture surgical environments.