π€ AI Summary
In surgical scene semantic segmentation, existing methods often neglect anatomical structures and struggle to jointly model low- and high-level features. To address these challenges, we propose the Dual-Stream Feature Adaptive Spatial Localization Network (DF-ASLNet), which employs parallel low-level edge and high-level contextual branches. By integrating multi-scale feature fusion with spatial attention mechanisms, DF-ASLNet achieves hierarchical feature alignment and precise spatial localization, enabling simultaneous pixel-wise segmentation of both anatomical structures and surgical instruments within a unified framework. Evaluated on the EndoVis17 and EndoVis18 datasets, our method achieves overall mIoU scores of 72.78% and 72.71%, respectively, and an instrument-specific mIoU of 85.61%, substantially outperforming current state-of-the-art approaches. These results demonstrate DF-ASLNetβs superior robustness and fine-grained semantic parsing capability, providing a stronger foundation for vision-based understanding and intelligent assistance in minimally invasive surgery.
π Abstract
The growing popularity of robotic minimally invasive surgeries has made deep learning-based surgical training a key area of research. A thorough understanding of the surgical scene components is crucial, which semantic segmentation models can help achieve. However, most existing work focuses on surgical tools and overlooks anatomical objects. Additionally, current state-of-the-art (SOTA) models struggle to balance capturing high-level contextual features and low-level edge features. We propose a Feature-Adaptive Spatial Localization model (FASL-Seg), designed to capture features at multiple levels of detail through two distinct processing streams, namely a Low-Level Feature Projection (LLFP) and a High-Level Feature Projection (HLFP) stream, for varying feature resolutions - enabling precise segmentation of anatomy and surgical instruments. We evaluated FASL-Seg on surgical segmentation benchmark datasets EndoVis18 and EndoVis17 on three use cases. The FASL-Seg model achieves a mean Intersection over Union (mIoU) of 72.71% on parts and anatomy segmentation in EndoVis18, improving on SOTA by 5%. It further achieves a mIoU of 85.61% and 72.78% in EndoVis18 and EndoVis17 tool type segmentation, respectively, outperforming SOTA overall performance, with comparable per-class SOTA results in both datasets and consistent performance in various classes for anatomy and instruments, demonstrating the effectiveness of distinct processing streams for varying feature resolutions.