🤖 AI Summary
To address inaccurate pose estimation caused by the lack of end-effector proprioception in minimally invasive robotic surgery, this paper proposes a real-time vision-based pose correction method. Methodologically, it introduces the first end-to-end differentiable robotic kinematic model tightly coupled with neural rendering (built upon Kaolin/DiffRend), forming a vision Transformer-driven simulation-to-real joint optimization framework that enables noise-robust self-supervised training in simulation and effective sim-to-real transfer. Experiments demonstrate single-frame inference latency under 10 ms and a 62% reduction in pose estimation error compared to joint encoder–based approaches, significantly improving both accuracy and generalization of visual pose estimation. The core contribution lies in establishing a novel differentiable kinematics–neural rendering co-design paradigm, providing a new pathway toward high-accuracy, low-latency, and robust visual pose perception for surgical robots.
📝 Abstract
Autonomy in Minimally Invasive Robotic Surgery (MIRS) has the potential to reduce surgeon cognitive and task load, thereby increasing procedural efficiency. However, implementing accurate autonomous control can be difficult due to poor end-effector proprioception, a limitation of their cable-driven mechanisms. Although the robot may have joint encoders for the end-effector pose calculation, various non-idealities make the entire kinematics chain inaccurate. Modern vision-based pose estimation methods lack real-time capability or can be hard to train and generalize. In this work, we demonstrate a real-time capable, vision transformer-based pose estimation approach that is trained using end-to-end differentiable kinematics and rendering in simulation. We demonstrate the potential of this method to correct for noisy pose estimates in simulation, with the longer term goal of verifying the sim-to-real transferability of our approach.