🤖 AI Summary
This work addresses the challenge of constructing high-fidelity, controllable digital twins of surgical instruments from pose-free endoscopic videos to support Real2Sim applications in robot-assisted surgery. The authors propose a novel framework based on monocular 3D Gaussian splatting, which, for the first time, integrates CAD priors into the Gaussian representation to enable part-aware rendering. They introduce a Semantic-Aware Pose Estimation and Tracking (SAPET) method that leverages purely synthetic semantic supervision to accurately recover 6-DoF poses and joint angles from unposed videos. Additionally, a Robust Texture Learning (RTL) strategy is developed to jointly optimize pose and appearance. Evaluated on EndoVis17/18, SAR-RARP, and an internal dataset, the method outperforms existing approaches in photometric quality, geometric accuracy, and downstream keypoint detection tasks.
📝 Abstract
High-quality and controllable digital twins of surgical instruments are critical for Real2Sim in robot-assisted surgery, as they enable realistic simulation, synthetic data generation, and perception learning under novel poses. We present Instrument-Splatting++, a monocular 3D Gaussian Splatting (3DGS) framework that reconstructs surgical instruments as a fully controllable Gaussian asset with high fidelity. Our pipeline starts with part-wise geometry pretraining that injects CAD priors into Gaussian primitives and equips the representation with part-aware semantic rendering. Built on the pretrained model, we propose a semantics-aware pose estimation and tracking (SAPET) method to recover per-frame 6-DoF pose and joint angles from unposed endoscopic videos, where a gripper-tip network trained purely from synthetic semantics provides robust supervision and a loose regularization suppresses singular articulations. Finally, we introduce Robust Texture Learning (RTL), which alternates pose refinement and robust appearance optimization, mitigating pose noise during texture learning. The proposed framework can perform pose estimation and learn realistic texture from unposed videos. We validate our method on sequences extracted from EndoVis17/18, SAR-RARP, and an in-house dataset, showing superior photometric quality and improved geometric accuracy over state-of-the-art baselines. We further demonstrate a downstream keypoint detection task where unseen-pose data augmentation from our controllable instrument Gaussian improves performance.