🤖 AI Summary
To address the accuracy and efficiency bottlenecks in 6D pose estimation of surgical instruments—stemming from high calibration costs and scarcity of public benchmarks—this paper introduces SurgPose: the first large-scale surgical instrument pose estimation dataset featuring instance-level semantic keypoints and native stereo support. We propose a novel UV-fluorescent marker-based annotation method enabling non-intrusive, high-precision keypoint labeling. SurgPose comprises 120K instrument instances across six categories, each annotated with seven semantic keypoints; reliable 3D pose ground truth is generated via synchronized stereo video and 2D-to-3D lifting. The dataset contains 80K training and 40K validation samples, accompanied by standardized benchmarking protocols using models such as HRNet. Experiments demonstrate significant improvements in both keypoint detection and 6D pose estimation performance, establishing SurgPose as a foundational resource for augmented reality–guided navigation and learning-based autonomous surgical manipulation.
📝 Abstract
Accurate and efficient surgical robotic tool pose estimation is of fundamental significance to downstream applications such as augmented reality (AR) in surgical training and learning-based autonomous manipulation. While significant advancements have been made in pose estimation for humans and animals, it is still a challenge in surgical robotics due to the scarcity of published data. The relatively large absolute error of the da Vinci end effector kinematics and arduous calibration procedure make calibrated kinematics data collection expensive. Driven by this limitation, we collected a dataset, dubbed SurgPose, providing instance-aware semantic keypoints and skeletons for visual surgical tool pose estimation and tracking. By marking keypoints using ultraviolet (UV) reactive paint, which is invisible under white light and fluorescent under UV light, we execute the same trajectory under different lighting conditions to collect raw videos and keypoint annotations, respectively. The SurgPose dataset consists of approximately 120k surgical instrument instances (80k for training and 40k for validation) of 6 categories. Each instrument instance is labeled with 7 semantic keypoints. Since the videos are collected in stereo pairs, the 2D pose can be lifted to 3D based on stereo-matching depth. In addition to releasing the dataset, we test a few baseline approaches to surgical instrument tracking to demonstrate the utility of SurgPose. More details can be found at surgpose.github.io.