🤖 AI Summary
Image-guided pituitary surgery demands intraoperative AI co-pilots capable of dynamic interaction and task planning, yet existing static models lack support for multimodal real-time decision-making in this complex neurosurgical context. Method: We introduce PitAgent—the first context-aware multimodal dataset specifically designed for pituitary surgery—and propose FFT-GaLore, an efficient low-rank fine-tuning method enabling lightweight adaptation of LLaMA 3.2 for structured surgical task planning. Our end-to-end system integrates a vision-language model (VLM), multimodal task orchestration, anatomical segmentation, preoperative-intraoperative image registration, surgical instrument tracking, and surgical visual question answering (VQA). Contribution/Results: Experiments demonstrate state-of-the-art performance in surgical task planning and prompt generation; zero-shot surgical VQA achieves significantly improved semantic accuracy. The system validates real-time responsiveness, clinical interpretability, and interactive usability in intraoperative settings.
📝 Abstract
Image-guided surgery demands adaptive, real-time decision support, yet static AI models struggle with structured task planning and providing interactive guidance. Large vision-language models (VLMs) offer a promising solution by enabling dynamic task planning and predictive decision support. We introduce SurgicalVLM-Agent, an AI co-pilot for image-guided pituitary surgery, capable of conversation, planning, and task execution. The agent dynamically processes surgeon queries and plans the tasks such as MRI tumor segmentation, endoscope anatomy segmentation, overlaying preoperative imaging with intraoperative views, instrument tracking, and surgical visual question answering (VQA). To enable structured task planning, we develop the PitAgent dataset, a surgical context-aware dataset covering segmentation, overlaying, instrument localization, tool tracking, tool-tissue interactions, phase identification, and surgical activity recognition. Additionally, we propose FFT-GaLore, a fast Fourier transform (FFT)-based gradient projection technique for efficient low-rank adaptation, optimizing fine-tuning for LLaMA 3.2 in surgical environments. We validate SurgicalVLM-Agent by assessing task planning and prompt generation on our PitAgent dataset and evaluating zero-shot VQA using a public pituitary dataset. Results demonstrate state-of-the-art performance in task planning and query interpretation, with highly semantically meaningful VQA responses, advancing AI-driven surgical assistance.