🤖 AI Summary
In minimally invasive surgical robots, conventional camera-to-robot calibration fails due to long instrument kinematic chains and frequent endoscopic field occlusion—violating the rigidity assumption, causing feature invisibility, unstable detection, and slow inference. To address this, we propose a unified geometric feature detection framework that jointly detects instrument keypoints and rod-shaped edge contours via a shared encoder network, while integrating projection geometry constraints to enable pose reconstruction in a single forward pass. The model is trained end-to-end on large-scale synthetic data without requiring real-world annotations. Evaluated on real surgical scenes, our method achieves superior accuracy (32% reduction in mean reprojection error) and speed (inference <15 ms) compared to state-of-the-art keypoint-based and rendering-based baselines. It is the first approach to simultaneously satisfy the robustness and real-time requirements of online closed-loop control in surgical robotics.
📝 Abstract
Accurate camera-to-robot calibration is essential for any vision-based robotic control system and especially critical in minimally invasive surgical robots, where instruments conduct precise micro-manipulations. However, MIS robots have long kinematic chains and partial visibility of their degrees of freedom in the camera, which introduces challenges for conventional camera-to-robot calibration methods that assume stiff robots with good visibility. Previous works have investigated both keypoint-based and rendering-based approaches to address this challenge in real-world conditions; however, they often struggle with consistent feature detection or have long inference times, neither of which are ideal for online robot control. In this work, we propose a novel framework that unifies the detection of geometric primitives (keypoints and shaft edges) through a shared encoding, enabling efficient pose estimation via projection geometry. This architecture detects both keypoints and edges in a single inference and is trained on large-scale synthetic data with projective labeling. This method is evaluated across both feature detection and pose estimation, with qualitative and quantitative results demonstrating fast performance and state-of-the-art accuracy in challenging surgical environments.