๐ค AI Summary
This work addresses the challenges of low accuracy and poor interpretability in primary intention (PI) recognition from laparoscopic surgical videos. We propose a grammar-guided visionโsemantics co-modeling framework that, for the first time, incorporates structured surgical activity grammar rules into surgical intention modeling. Our approach constructs a grammar-driven, interpretable parser jointly optimized with a multi-stage visual action detector. By integrating top-down semantic constraints with bottom-up visual features, it overcomes the semantic limitations inherent in purely data-driven methods. Evaluated on a standard benchmark dataset, the proposed method achieves significant improvements in PI recognition accuracy and robustness. It provides a foundation for intraoperative planning in intelligent surgical robots that simultaneously ensures high precision and model interpretability.
๐ Abstract
Surgical procedures are inherently complex and dynamic, with intricate dependencies and various execution paths. Accurate identification of the intentions behind critical actions, referred to as Primary Intentions (PIs), is crucial to understanding and planning the procedure. This paper presents a novel framework that advances PI recognition in instructional videos by combining top-down grammatical structure with bottom-up visual cues. The grammatical structure is based on a rich corpus of surgical procedures, offering a hierarchical perspective on surgical activities. A grammar parser, utilizing the surgical activity grammar, processes visual data obtained from laparoscopic images through surgical action detectors, ensuring a more precise interpretation of the visual information. Experimental results on the benchmark dataset demonstrate that our method outperforms existing surgical activity detectors that rely solely on visual features. Our research provides a promising foundation for developing advanced robotic surgical systems with enhanced planning and automation capabilities.