Instrument-Splatting++: Towards Controllable Surgical Instrument Digital Twin Using Gaussian Splatting

📅 2026-03-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of constructing high-fidelity, controllable digital twins of surgical instruments from pose-free endoscopic videos to support Real2Sim applications in robot-assisted surgery. The authors propose a novel framework based on monocular 3D Gaussian splatting, which, for the first time, integrates CAD priors into the Gaussian representation to enable part-aware rendering. They introduce a Semantic-Aware Pose Estimation and Tracking (SAPET) method that leverages purely synthetic semantic supervision to accurately recover 6-DoF poses and joint angles from unposed videos. Additionally, a Robust Texture Learning (RTL) strategy is developed to jointly optimize pose and appearance. Evaluated on EndoVis17/18, SAR-RARP, and an internal dataset, the method outperforms existing approaches in photometric quality, geometric accuracy, and downstream keypoint detection tasks.

Technology Category

Application Category

📝 Abstract
High-quality and controllable digital twins of surgical instruments are critical for Real2Sim in robot-assisted surgery, as they enable realistic simulation, synthetic data generation, and perception learning under novel poses. We present Instrument-Splatting++, a monocular 3D Gaussian Splatting (3DGS) framework that reconstructs surgical instruments as a fully controllable Gaussian asset with high fidelity. Our pipeline starts with part-wise geometry pretraining that injects CAD priors into Gaussian primitives and equips the representation with part-aware semantic rendering. Built on the pretrained model, we propose a semantics-aware pose estimation and tracking (SAPET) method to recover per-frame 6-DoF pose and joint angles from unposed endoscopic videos, where a gripper-tip network trained purely from synthetic semantics provides robust supervision and a loose regularization suppresses singular articulations. Finally, we introduce Robust Texture Learning (RTL), which alternates pose refinement and robust appearance optimization, mitigating pose noise during texture learning. The proposed framework can perform pose estimation and learn realistic texture from unposed videos. We validate our method on sequences extracted from EndoVis17/18, SAR-RARP, and an in-house dataset, showing superior photometric quality and improved geometric accuracy over state-of-the-art baselines. We further demonstrate a downstream keypoint detection task where unseen-pose data augmentation from our controllable instrument Gaussian improves performance.
Problem

Research questions and friction points this paper is trying to address.

surgical instrument digital twin
controllable 3D reconstruction
monocular endoscopic video
pose estimation
realistic texture learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

3D Gaussian Splatting
surgical instrument digital twin
semantic-aware pose estimation
robust texture learning
CAD prior integration
🔎 Similar Papers
No similar papers found.
S
Shuojue Yang
Department of Biomedical Engineering, National University of Singapore (NUS), Singapore
Zijian Wu
Zijian Wu
University of British Columbia
Surgical RoboticsImage Guided SurgeryRobot-assisted Surgery
C
Chengjiaao Liao
Department of Biomedical Engineering, National University of Singapore (NUS), Singapore
Q
Qian Li
Department of Biomedical Engineering, National University of Singapore (NUS), Singapore
Daiyun Shen
Daiyun Shen
PhD at National University of Singapore
medical AI
Chang Han Low
Chang Han Low
National University of Singapore
Surgical AIMedical ImagingMulti-Agent SystemMulti-Modal
S
Septimiu E. Salcudean
Department of Electrical and Computer Engineering, The University of British Columbia, Vancouver, Canada
Yueming Jin
Yueming Jin
Assistant Professor, National University of Singapore
Medical Image AnalysisSurgical AI&RoboticsMultimodal Learning