🤖 AI Summary
Current AI-powered surgical assistance systems suffer from rigid task definitions, fixed category priors, and dependence on dense, labor-intensive annotations—hindering dynamic, natural intraoperative human–machine interaction. To address this, we propose a memory-augmented multimodal perception agent that integrates speech-driven large language model (LLM) prompting, the Segment Anything Model (SAM), and an arbitrary-point tracking module, enabling intuitive, zero-shot segmentation of unseen surgical objects and cross-scenario generalization. The agent operates without explicit manual prompts or predefined semantic categories, supporting real-time, adaptive human–robot collaboration. Evaluated on public benchmarks, its segmentation accuracy matches expert manual annotations; on our novel in-house dataset, it successfully generalizes to previously unseen surgical instruments and simulated grafts. These results demonstrate strong zero-shot generalization capability and clinical translatability, advancing the paradigm of symbiotic human–machine surgery.
📝 Abstract
Emerging surgical data science and robotics solutions, especially those designed to provide assistance in situ, require natural human-machine interfaces to fully unlock their potential in providing adaptive and intuitive aid. Contemporary AI-driven solutions remain inherently rigid, offering limited flexibility and restricting natural human-machine interaction in dynamic surgical environments. These solutions rely heavily on extensive task-specific pre-training, fixed object categories, and explicit manual-prompting. This work introduces a novel Perception Agent that leverages speech-integrated prompt-engineered large language models (LLMs), segment anything model (SAM), and any-point tracking foundation models to enable a more natural human-machine interaction in real-time intraoperative surgical assistance. Incorporating a memory repository and two novel mechanisms for segmenting unseen elements, Perception Agent offers the flexibility to segment both known and unseen elements in the surgical scene through intuitive interaction. Incorporating the ability to memorize novel elements for use in future surgeries, this work takes a marked step towards human-machine symbiosis in surgical procedures. Through quantitative analysis on a public dataset, we show that the performance of our agent is on par with considerably more labor-intensive manual-prompting strategies. Qualitatively, we show the flexibility of our agent in segmenting novel elements (instruments, phantom grafts, and gauze) in a custom-curated dataset. By offering natural human-machine interaction and overcoming rigidity, our Perception Agent potentially brings AI-based real-time assistance in dynamic surgical environments closer to reality.