🤖 AI Summary
This work addresses the challenge of accurate surgical instrument tracking in robotic surgery, where occlusions, complex joint structures, and degraded image quality hinder reliable pose estimation. The authors propose a novel tracking framework that integrates covariance matrix adaptation evolution strategy (CMA-ES) with batched GPU-based rendering. For the first time, CMA-ES is employed in rendering-based pose estimation to jointly optimize both the 6D pose of surgical instruments and their underlying joint configurations, eliminating the need for prior joint angle information and enabling support for bimanual instrument scenarios. By evaluating multiple pose hypotheses in parallel, the method achieves a favorable balance between real-time performance and robustness. Extensive experiments on both synthetic and real datasets demonstrate that the proposed approach significantly outperforms existing methods in terms of accuracy and speed, while exhibiting superior convergence stability and generalization capability.
📝 Abstract
Accurate and efficient tracking of surgical instruments is fundamental for Robot-Assisted Minimally Invasive Surgery. Although vision-based robot pose estimation has enabled markerless calibration without tedious physical setups, reliable tool tracking for surgical robots still remains challenging due to partial visibility and specialized articulation design of surgical instruments. Previous works in the field are usually prone to unreliable feature detections under degraded visual quality and data scarcity, whereas rendering-based methods often struggle with computational costs and suboptimal convergence. In this work, we incorporate CMA-ES, an evolutionary optimization strategy, into a versatile tracking pipeline that jointly estimates surgical instrument pose and joint configurations. Using batch rendering to efficiently evaluate multiple pose candidates in parallel, the method significantly reduces inference time and improves convergence robustness. The proposed framework further generalizes to joint angle-free and bi-manual tracking settings, making it suitable for both vision feedback control and online surgery video calibration. Extensive experiments on synthetic and real-world datasets demonstrate that the proposed method significantly outperforms prior approaches in both accuracy and runtime.