GroundedSurg: A Multi-Procedure Benchmark for Language-Conditioned Surgical Tool Segmentation

📅 2026-03-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing surgical instrument benchmarks support only category-level segmentation, which is insufficient for clinical applications requiring precise localization of specific tool instances based on function, spatial relationships, or anatomical interactions. This work proposes the first language-guided benchmark for instance-level surgical tool localization, introducing a novel language-conditioned instance segmentation task that spans multiple surgical procedures, imaging modalities, and complex operative scenarios. By pairing natural language descriptions with images and incorporating both bounding box and point-level anchor annotations, the benchmark jointly evaluates vision–language models’ capabilities in referential grounding and pixel-level localization within multi-instrument settings. Experiments reveal that current state-of-the-art models perform poorly on this task, underscoring the urgent need for surgical AI systems to develop robust visual–language reasoning grounded in clinical context.

Technology Category

Application Category

📝 Abstract
Clinically reliable perception of surgical scenes is essential for advancing intelligent, context-aware intraoperative assistance such as instrument handoff guidance, collision avoidance, and workflow-aware robotic support. Existing surgical tool benchmarks primarily evaluate category-level segmentation, requiring models to detect all instances of predefined instrument classes. However, real-world clinical decisions often require resolving references to a specific instrument instance based on its functional role, spatial relation, or anatomical interaction capabilities not captured by current evaluation paradigms. We introduce GroundedSurg, the first language-conditioned, instance-level surgical grounding benchmark. Each instance pairs a surgical image with a natural-language description targeting a single instrument, accompanied by structured spatial grounding annotations including bounding boxes and point-level anchors. The dataset spans ophthalmic, laparoscopic, robotic, and open procedures, encompassing diverse instrument types, imaging conditions, and operative complexities. By jointly evaluating linguistic reference resolution and pixel-level localization, GroundedSurg enables a systematic and realistic evaluation of vision-language models in clinically realistic multi-instrument scenes. Extensive experiments demonstrate substantial performance gaps across modern segmentation and VLMs, highlighting the urgent need for clinically grounded vision-language reasoning in surgical AI systems. Code and data are publicly available at https://github.com/gaash-lab/GroundedSurg
Problem

Research questions and friction points this paper is trying to address.

surgical tool segmentation
language-conditioned grounding
instance-level perception
vision-language models
clinical scene understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

language-conditioned segmentation
surgical tool grounding
instance-level annotation
vision-language models
surgical scene understanding
Tajamul Ashraf
Tajamul Ashraf
IIT Delhi, MBZUAI
Computer VisionDeep Learning
A
Abrar Ul Riyaz
Gaash Research Lab, National Institute of Technology Srinagar, India
W
Wasif Tak
Thapar Institute of Engineering and Technology, India
T
Tavaheed Tariq
Gaash Research Lab, National Institute of Technology Srinagar, India
S
Sonia Yadav
Gaash Research Lab, National Institute of Technology Srinagar, India
Moloud Abdar
Moloud Abdar
Senior Data Scientist, The University of Queensland, Australia
Machine LearningDeep LearningComputer VisionVision-Language ModelsSentiment Analysis
J
Janibul Bashir
Gaash Research Lab, National Institute of Technology Srinagar, India