🤖 AI Summary
To address the challenge of real-time, robust surgical instrument localization in minimally invasive robotic-assisted surgery (RAS) video streams, this work introduces SurgToolLoc—the first large-scale, multi-view, multi-scenario benchmark dataset with pixel-level mask annotations. We further propose a novel evaluation protocol emphasizing both cross-center generalizability and real-time inference (≥30 FPS). Methodologically, we integrate instance segmentation and keypoint detection with temporal modeling (ConvLSTM/Transformer), domain adaptation, and weakly supervised learning. Our best-performing model achieves 92.4% mAP@0.5 on the test set while maintaining an inference speed of 36 FPS—substantially outperforming conventional template matching and early CNN-based approaches. The solution has undergone rigorous preclinical validation across multiple surgical scenarios. By providing a reproducible, scalable, end-to-end framework for visual instrument localization in RAS, this work establishes a new standard for benchmarking and advancing vision-based surgical navigation systems.
📝 Abstract
Robotic assisted (RA) surgery promises to transform surgical intervention. Intuitive Surgical is committed to fostering these changes and the machine learning models and algorithms that will enable them. With these goals in mind we have invited the surgical data science community to participate in a yearly competition hosted through the Medical Imaging Computing and Computer Assisted Interventions (MICCAI) conference. With varying changes from year to year, we have challenged the community to solve difficult machine learning problems in the context of advanced RA applications. Here we document the results of these challenges, focusing on surgical tool localization (SurgToolLoc). The publicly released dataset that accompanies these challenges is detailed in a separate paper arXiv:2501.09209 [1].