SCOPE: Speech-guided COllaborative PErception Framework for Surgical Scene Segmentation

📅 2025-09-12

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

Existing surgical scene segmentation methods rely heavily on annotated data and predefined categories, resulting in poor generalizability. While prompt-based vision foundation models (VFMs) enable zero-shot segmentation, their dependence on manual visual or textual prompts hinders intraoperative deployment. To address this, we propose a speech-guided collaborative perception framework that synergistically integrates large language models (LLMs) with open-vocabulary VFMs, enabling a hands-free, zero-shot, and dynamically adaptive surgical video understanding system. Our approach leverages spoken instructions to drive visual candidate generation and employs surgical instruments as natural pointing tools to extend contextual understanding—supporting ad-hoc segmentation, labeling, and tracking of both instruments and anatomical structures. We validate real-time performance and robustness on the Cataract1k dataset and a newly curated ex vivo skull-base dataset, and demonstrate dynamic responsiveness through live simulated surgical experiments.

Technology Category

Application Category

📝 Abstract

Accurate segmentation and tracking of relevant elements of the surgical scene is crucial to enable context-aware intraoperative assistance and decision making. Current solutions remain tethered to domain-specific, supervised models that rely on labeled data and required domain-specific data to adapt to new surgical scenarios and beyond predefined label categories. Recent advances in prompt-driven vision foundation models (VFM) have enabled open-set, zero-shot segmentation across heterogeneous medical images. However, dependence of these models on manual visual or textual cues restricts their deployment in introperative surgical settings. We introduce a speech-guided collaborative perception (SCOPE) framework that integrates reasoning capabilities of large language model (LLM) with perception capabilities of open-set VFMs to support on-the-fly segmentation, labeling and tracking of surgical instruments and anatomy in intraoperative video streams. A key component of this framework is a collaborative perception agent, which generates top candidates of VFM-generated segmentation and incorporates intuitive speech feedback from clinicians to guide the segmentation of surgical instruments in a natural human-machine collaboration paradigm. Afterwards, instruments themselves serve as interactive pointers to label additional elements of the surgical scene. We evaluated our proposed framework on a subset of publicly available Cataract1k dataset and an in-house ex-vivo skull-base dataset to demonstrate its potential to generate on-the-fly segmentation and tracking of surgical scene. Furthermore, we demonstrate its dynamic capabilities through a live mock ex-vivo experiment. This human-AI collaboration paradigm showcase the potential of developing adaptable, hands-free, surgeon-centric tools for dynamic operating-room environments.

Problem

Research questions and friction points this paper is trying to address.

Enabling open-set surgical scene segmentation without manual visual cues

Integrating speech feedback with vision models for real-time instrument tracking

Overcoming domain-specific data dependency in intraoperative assistance systems

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates LLM reasoning with VFMs for segmentation

Uses clinician speech feedback to guide instrument segmentation

Employs instruments as pointers to label surgical elements

🔎 Similar Papers

No similar papers found.