SCOPE: Speech-guided COllaborative PErception Framework for Surgical Scene Segmentation

📅 2025-09-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing surgical scene segmentation methods rely heavily on annotated data and predefined categories, resulting in poor generalizability. While prompt-based vision foundation models (VFMs) enable zero-shot segmentation, their dependence on manual visual or textual prompts hinders intraoperative deployment. To address this, we propose a speech-guided collaborative perception framework that synergistically integrates large language models (LLMs) with open-vocabulary VFMs, enabling a hands-free, zero-shot, and dynamically adaptive surgical video understanding system. Our approach leverages spoken instructions to drive visual candidate generation and employs surgical instruments as natural pointing tools to extend contextual understanding—supporting ad-hoc segmentation, labeling, and tracking of both instruments and anatomical structures. We validate real-time performance and robustness on the Cataract1k dataset and a newly curated ex vivo skull-base dataset, and demonstrate dynamic responsiveness through live simulated surgical experiments.

Technology Category

Application Category

📝 Abstract
Accurate segmentation and tracking of relevant elements of the surgical scene is crucial to enable context-aware intraoperative assistance and decision making. Current solutions remain tethered to domain-specific, supervised models that rely on labeled data and required domain-specific data to adapt to new surgical scenarios and beyond predefined label categories. Recent advances in prompt-driven vision foundation models (VFM) have enabled open-set, zero-shot segmentation across heterogeneous medical images. However, dependence of these models on manual visual or textual cues restricts their deployment in introperative surgical settings. We introduce a speech-guided collaborative perception (SCOPE) framework that integrates reasoning capabilities of large language model (LLM) with perception capabilities of open-set VFMs to support on-the-fly segmentation, labeling and tracking of surgical instruments and anatomy in intraoperative video streams. A key component of this framework is a collaborative perception agent, which generates top candidates of VFM-generated segmentation and incorporates intuitive speech feedback from clinicians to guide the segmentation of surgical instruments in a natural human-machine collaboration paradigm. Afterwards, instruments themselves serve as interactive pointers to label additional elements of the surgical scene. We evaluated our proposed framework on a subset of publicly available Cataract1k dataset and an in-house ex-vivo skull-base dataset to demonstrate its potential to generate on-the-fly segmentation and tracking of surgical scene. Furthermore, we demonstrate its dynamic capabilities through a live mock ex-vivo experiment. This human-AI collaboration paradigm showcase the potential of developing adaptable, hands-free, surgeon-centric tools for dynamic operating-room environments.
Problem

Research questions and friction points this paper is trying to address.

Enabling open-set surgical scene segmentation without manual visual cues
Integrating speech feedback with vision models for real-time instrument tracking
Overcoming domain-specific data dependency in intraoperative assistance systems
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates LLM reasoning with VFMs for segmentation
Uses clinician speech feedback to guide instrument segmentation
Employs instruments as pointers to label surgical elements
🔎 Similar Papers
No similar papers found.