SurgCUT3R: Surgical Scene-Aware Continuous Understanding of Temporal 3D Representation

📅 2026-03-07

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

This work addresses the challenges of 3D surgical scene reconstruction from monocular endoscopic videos, which are hindered by the absence of real-world supervisory signals and performance degradation over long sequences. To overcome these limitations, the authors propose a systematic framework that first leverages publicly available stereo surgical data to construct a large-scale, metrically accurate pseudo-depth map generation pipeline. They then introduce a hybrid supervision strategy incorporating geometric self-correction, followed by a dual-model hierarchical inference mechanism that simultaneously preserves global temporal consistency and enhances local geometric fidelity. Evaluated on the SCARED and StereoMIS datasets, the method achieves near state-of-the-art reconstruction accuracy while significantly accelerating pose estimation, offering an efficient and robust solution for monocular 3D reconstruction in surgical navigation.

Technology Category

Application Category

📝 Abstract

Reconstructing surgical scenes from monocular endoscopic video is critical for advancing robotic-assisted surgery. However, the application of state-of-the-art general-purpose reconstruction models is constrained by two key challenges: the lack of supervised training data and performance degradation over long video sequences. To overcome these limitations, we propose SurgCUT3R, a systematic framework that adapts unified 3D reconstruction models to the surgical domain. Our contributions are threefold. First, we develop a data generation pipeline that exploits public stereo surgical datasets to produce large-scale, metric-scale pseudo-ground-truth depth maps, effectively bridging the data gap. Second, we propose a hybrid supervision strategy that couples our pseudo-ground-truth with geometric self-correction to enhance robustness against inherent data imperfections. Third, we introduce a hierarchical inference framework that employs two specialized models to effectively mitigate accumulated pose drift over long surgical videos: one for global stability and one for local accuracy. Experiments on the SCARED and StereoMIS datasets demonstrate that our method achieves a competitive balance between accuracy and efficiency, delivering near state-of-the-art but substantially faster pose estimation and offering a practical and effective solution for robust reconstruction in surgical environments. Project page: https://chumo-xu.github.io/SurgCUT3R-ICRA26/.

Problem

Research questions and friction points this paper is trying to address.

surgical scene reconstruction

monocular endoscopic video

3D reconstruction

long video sequences

supervised training data

Innovation

Methods, ideas, or system contributions that make the work stand out.

surgical scene reconstruction

pseudo-ground-truth depth

hybrid supervision