π€ AI Summary
This work addresses the limitations of current vision-language models in recognizing surgical instrument-tissue interactions, which struggle to capture temporal dynamics and lack fine-grained action semantics in visual-linguistic alignment. To overcome these challenges, the study introduces instrument trajectories as temporal conditioning signals to guide the generation of more precise visual semantic embeddings. Furthermore, it integrates prompt tuning with a verb rephrasing strategy to enhance the modelβs semantic understanding of interaction actions. Evaluated on the CholecT50 dataset, the proposed approach achieves significant improvements in both average precision (AP) and Top-K accuracy. Qualitative analysis via cosine similarity visualizations further confirms enhanced alignment between visual and textual embeddings, demonstrating the effectiveness of the method in capturing nuanced surgical action semantics.
π Abstract
Recognizing instruments'interactions with tissues is essential for building context-aware AI assistants in robotic surgery. Vision-language models (VLMs) have opened a new avenue for surgical perception and achieved better generalization on a wide range of tasks compared to conventional task-specific deep learning approaches. However, their performance on instrument--tissue interaction recognition remains limited, largely due to two challenges: (1) many models do not effectively leverage temporal information, and (2) alignment between vision and text often misses fine-grained action details. To address these issues, we propose TrajPred, a framework that encodes instrument trajectories to incorporate temporal motion cues and, conditioned on these trajectories, introduces a predictor module to generate visual semantic embeddings that better capture fine-grained action details. We further incorporate prompt tuning and a verb-rephrasing technique to enable smooth adaptation to the instrument--tissue interaction recognition task. Extensive experiments on the public laparoscopic benchmark, CholecT50, show that our method improves both Average Precision and Top-K accuracy. We also investigate whether visual embeddings of instrument--tissue interaction regions align better with the corresponding text by visualizing the cosine similarity between visual and textual embeddings. The visualization results indicate that the proposed method improves alignment between relevant visual and textual representations.