🤖 AI Summary
This study addresses the limitations of existing methods for 3D reconstruction in surgical scenes, which often perform poorly in instrument-occluded regions and lack rigorous evaluation of depth accuracy. The authors propose a two-stage framework: first, a diffusion-based video model incorporating temporal priors is introduced to inpaint occluded tissue with high spatiotemporal consistency; second, a learnable deformation-aware 2D Gaussian Splatting (2DGS) module reconstructs dynamic tissue geometry and appearance. This work pioneers the use of temporally aware diffusion models for surgical occlusion inpainting and advances evaluation by introducing quantitative depth accuracy metrics beyond conventional image-based benchmarks. The method achieves PSNR scores of 38.02 dB and 34.40 dB on EndoNeRF and StereoMIS, respectively, and demonstrates superior depth reconstruction fidelity on the SCARED dataset.
📝 Abstract
Real-time reconstruction of deformable surgical scenes is vital for advancing robotic surgery, improving surgeon guidance, and enabling automation. Recent methods achieve dense reconstructions from da Vinci robotic surgery videos, with Gaussian Splatting (GS) offering real-time performance via graphics acceleration. However, reconstruction quality in occluded regions remains limited, and depth accuracy has not been fully assessed, as benchmarks like EndoNeRF and StereoMIS lack 3D ground truth. We propose Diff2DGS, a novel two-stage framework for reliable 3D reconstruction of occluded surgical scenes. In the first stage, a diffusion-based video module with temporal priors inpaints tissue occluded by instruments with high spatial-temporal consistency. In the second stage, we adapt 2D Gaussian Splatting (2DGS) with a Learnable Deformation Model (LDM) to capture dynamic tissue deformation and anatomical geometry. We also extend evaluation beyond prior image-quality metrics by performing quantitative depth accuracy analysis on the SCARED dataset. Diff2DGS outperforms state-of-the-art approaches in both appearance and geometry, reaching 38.02 dB PSNR on EndoNeRF and 34.40 dB on StereoMIS. Furthermore, our experiments demonstrate that optimizing for image quality alone does not necessarily translate into optimal 3D reconstruction accuracy. To address this, we further optimize the depth quality of the reconstructed 3D results, ensuring more faithful geometry in addition to high-fidelity appearance.