🤖 AI Summary
Existing approaches rely on static visual features and predefined instrument names, limiting their generalization in complex surgical scenarios involving occlusion, ambiguity, or nonstandard terminology. This work proposes SurgRef, a novel framework that leverages the motion patterns of surgical instruments as the primary semantic carrier. By introducing a motion-guided multimodal alignment architecture, SurgRef aligns free-form textual descriptions with the temporal dynamics of instrument behavior, thereby eliminating dependence on static appearance and fixed nomenclature. Evaluated on Ref-IMotion—a newly curated multi-institutional dataset—the method achieves state-of-the-art performance in referring segmentation accuracy and cross-scenario generalization, establishing a new benchmark for language-driven intelligent surgical systems.
📝 Abstract
Enabling intuitive, language-driven interaction with surgical scenes is a critical step toward intelligent operating rooms and autonomous surgical robotic assistance. However, the task of referring segmentation, localizing surgical instruments based on natural language descriptions, remains underexplored in surgical videos, with existing approaches struggling to generalize due to reliance on static visual cues and predefined instrument names. In this work, we introduce SurgRef, a novel motion-guided framework that grounds free-form language expressions in instrument motion, capturing how tools move and interact across time, rather than what they look like. This allows models to understand and segment instruments even under occlusion, ambiguity, or unfamiliar terminology. To train and evaluate SurgRef, we present Ref-IMotion, a diverse, multi-institutional video dataset with dense spatiotemporal masks and rich motion-centric expressions. SurgRef achieves state-of-the-art accuracy and generalization across surgical procedures, setting a new benchmark for robust, language-driven surgical video segmentation.