🤖 AI Summary
Current surgical visual question answering (Surgical-VQA) models suffer from limited capacity in modeling long-range dependencies and achieving precise multimodal alignment, hindering structured answer generation and accurate anatomical region localization—critical bottlenecks for clinical deployment in robotic surgery. To address these challenges, we propose a personalized large vision-language model (LVLM) tailored for intraoperative complex VQA and region localization. Our method introduces two key innovations: (1) VP-LoRA, a vision-perception adapter module that enhances fine-grained visual feature extraction; and (2) Token-Interaction (TIT), a cross-modal interaction mechanism enabling deep latent-space coordination between linguistic responses and visual localization. Integrated into a pre-trained LVLM framework, our approach incorporates VP-LoRA, TIT, and surgery-specific fine-tuning strategies. Extensive evaluation on EndoVis-17/18-VQLA and the newly curated EndoVis Conversations dataset demonstrates state-of-the-art performance, with significant improvements in both fine-grained anatomical localization accuracy and open-domain VQA capability.
📝 Abstract
Recent advancements in Surgical Visual Question Answering (Surgical-VQA) and related region grounding have shown great promise for robotic and medical applications, addressing the critical need for automated methods in personalized surgical mentorship. However, existing models primarily provide simple structured answers and struggle with complex scenarios due to their limited capability in recognizing long-range dependencies and aligning multimodal information. In this paper, we introduce Surgical-LVLM, a novel personalized large vision-language model tailored for complex surgical scenarios. Leveraging the pre-trained large vision-language model and specialized Visual Perception LoRA (VP-LoRA) blocks, our model excels in understanding complex visual-language tasks within surgical contexts. In addressing the visual grounding task, we propose the Token-Interaction (TIT) module, which strengthens the interaction between the grounding module and the language responses of the Large Visual Language Model (LVLM) after projecting them into the latent space. We demonstrate the effectiveness of Surgical-LVLM on several benchmarks, including EndoVis-17-VQLA, EndoVis-18-VQLA, and a newly introduced EndoVis Conversations dataset, which sets new performance standards. Our work contributes to advancing the field of automated surgical mentorship by providing a context-aware solution.