🤖 AI Summary
Existing text-prompted surgical instrument segmentation methods assume all prompted categories are present, leading to spurious mask generation when corresponding instruments are absent—thus critically relying on unavailable prior existence knowledge. Method: This paper introduces “existence-agnostic text-prompted segmentation” (R-SIS), a novel task, and proposes the RoSIS framework: (i) a multimodal fusion module jointly with a selective gating module to model cross-modal semantics; and (ii) a two-stage iterative refinement strategy (“name → location”) for joint learning of existence discrimination and mask optimization. Results: On multiple surgical datasets, RoSIS achieves zero false negatives while significantly outperforming state-of-the-art methods—improving mIoU by up to 5.2%. It is the first method to autonomously identify present categories solely from multi-class textual prompts and deliver highly robust, high-accuracy segmentation without requiring existence priors.
📝 Abstract
Surgical instrument segmentation (SIS) is essential in computer-assisted surgeries, with deep learning methods improving accuracy in complex environments. Recently, text-promptable segmentation methods have been introduced, generating masks based on textual descriptions. However, they assume the text-described object is present and always generate an associated mask even when the object is absent. Existing methods address this by using prompts only for objects already known to exist in the scene, which relies on inaccessible information. To address this, we rethink text-promptable SIS and redefine it under robust conditions as Robust text-promptable SIS (R-SIS). Unlike previous approaches, R-SIS is a process that analyzes text prompts for all surgical instrument categories without relying on external knowledge, identifies the instruments present in the scene, and segments them accordingly. Building on this, we propose Robust Surgical Instrument Segmentation (RoSIS), an optimized framework combining visual and language features for promptable segmentation in the R-SIS setting. RoSIS employs an encoder-decoder architecture with a Multi-Modal Fusion Block (MMFB) and a Selective Gate Block (SGB) for balanced integration of vision and language features. Additionally, an iterative refinement strategy enhances segmentation masks through a two-step process: an initial pass with name-based prompts, followed by refinement with location prompts. Experiments across multiple datasets and settings show that RoSIS outperforms existing vision-based and promptable segmentation methods under robust conditions. By rethinking text-promptable SIS, our work establishes a fair and effective approach to surgical instrument segmentation.