VLM-Guided Iterative Refinement for Surgical Image Segmentation with Foundation Models

📅 2026-02-09

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

This work addresses the limitations of existing surgical image segmentation methods, which are constrained by predefined categories, lack adaptive refinement capabilities, and do not support natural language interaction. To overcome these challenges, the authors propose IR-SIS, a novel interactive framework that leverages a fine-tuned SAM3 model to generate initial segmentations, integrates a vision-language model to assess segmentation quality and detect instruments, and employs an agent-based workflow to dynamically select refinement strategies. Crucially, IR-SIS introduces, for the first time, a human-in-the-loop mechanism that incorporates surgeons’ natural language feedback for iterative optimization. This approach transcends the static, closed-category paradigm of conventional methods, achieving state-of-the-art performance on both in-domain and out-of-distribution data from the EndoVis2017/2018 benchmarks, with surgeon interaction demonstrably enhancing segmentation accuracy.

Technology Category

Application Category

📝 Abstract

Surgical image segmentation is essential for robot-assisted surgery and intraoperative guidance. However, existing methods are constrained to predefined categories, produce one-shot predictions without adaptive refinement, and lack mechanisms for clinician interaction. We propose IR-SIS, an iterative refinement system for surgical image segmentation that accepts natural language descriptions. IR-SIS leverages a fine-tuned SAM3 for initial segmentation, employs a Vision-Language Model to detect instruments and assess segmentation quality, and applies an agentic workflow that adaptively selects refinement strategies. The system supports clinician-in-the-loop interaction through natural language feedback. We also construct a multi-granularity language-annotated dataset from EndoVis2017 and EndoVis2018 benchmarks. Experiments demonstrate state-of-the-art performance on both in-domain and out-of-distribution data, with clinician interaction providing additional improvements. Our work establishes the first language-based surgical segmentation framework with adaptive self-refinement capabilities.

Problem

Research questions and friction points this paper is trying to address.

surgical image segmentation

clinician interaction

adaptive refinement

foundation models

natural language feedback

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Language Model

Iterative Refinement

Surgical Image Segmentation