🤖 AI Summary
This work addresses the highly ill-posed problem of 3D pose estimation for surgical suturing needles under monocular endoscopic views, where depth ambiguity and rotational symmetry severely hinder accuracy. To tackle this challenge, the authors propose a probabilistic variational inference framework that fuses monocular observations with robotic grasping constraints. The approach explicitly models and maintains multimodal pose uncertainty through an analytically tractable geometric likelihood with closed-form Jacobians, Stein variational Newton updates, Gauss–Newton preconditioning, and kernel-based repulsion mechanisms, thereby avoiding premature convergence to incorrect modes. Experimental results on real-world data demonstrate significant improvements, reducing translational and rotational errors to 1.00 mm (↓80%) and 13.80° (↓78%), respectively. Moreover, the method exhibits robust tracking performance under occlusion, achieving average errors of 1.34 mm and 19.18°.
📝 Abstract
Reliable estimation of surgical needle 3D position and orientation is essential for autonomous robotic suturing, yet existing methods operate almost exclusively under stereoscopic vision. In monocular endoscopic settings, common in transendoscopic and intraluminal procedures, depth ambiguity and rotational symmetry render needle pose estimation inherently ill-posed, producing a multimodal distribution over feasible configurations, rather than a single, well-grounded estimate. We present PinPoint, a probabilistic variational inference framework that treats this ambiguity directly, maintaining a distribution of pose hypotheses rather than suppressing it. PinPoint combines monocular image observations with robot-grasp constraints through analytical geometric likelihoods with closed-form Jacobians. This framework enables efficient Gauss-Newton preconditioning in a Stein Variational Newton inference, where second-order particle transport deterministically moves particles toward high-probability regions while kernel-based repulsion preserves diversity in the multimodal structure. On real needle-tracking sequences, PinPoint reduces mean translational error by 80% (down to 1.00 mm) and rotational error by 78% (down to 13.80°) relative to a particle-filter baseline, with substantially better-calibrated uncertainty. On induced-rotation sequences, where monocular ambiguity is most severe, PinPoint maintains a bimodal posterior 84% of the time, almost three times the rate of the particle filter baseline, correctly preserving the alternative hypothesis rather than committing prematurely to one mode. Suturing experiments in ex vivo tissue demonstrate stable tracking through intermittent occlusion, with average errors during occlusion of 1.34 mm in translation and 19.18° in rotation, even when the needle is fully embedded.