SurgCUT3R: Surgical Scene-Aware Continuous Understanding of Temporal 3D Representation

📅 2026-03-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges of 3D surgical scene reconstruction from monocular endoscopic videos, which are hindered by the absence of real-world supervisory signals and performance degradation over long sequences. To overcome these limitations, the authors propose a systematic framework that first leverages publicly available stereo surgical data to construct a large-scale, metrically accurate pseudo-depth map generation pipeline. They then introduce a hybrid supervision strategy incorporating geometric self-correction, followed by a dual-model hierarchical inference mechanism that simultaneously preserves global temporal consistency and enhances local geometric fidelity. Evaluated on the SCARED and StereoMIS datasets, the method achieves near state-of-the-art reconstruction accuracy while significantly accelerating pose estimation, offering an efficient and robust solution for monocular 3D reconstruction in surgical navigation.

Technology Category

Application Category

📝 Abstract
Reconstructing surgical scenes from monocular endoscopic video is critical for advancing robotic-assisted surgery. However, the application of state-of-the-art general-purpose reconstruction models is constrained by two key challenges: the lack of supervised training data and performance degradation over long video sequences. To overcome these limitations, we propose SurgCUT3R, a systematic framework that adapts unified 3D reconstruction models to the surgical domain. Our contributions are threefold. First, we develop a data generation pipeline that exploits public stereo surgical datasets to produce large-scale, metric-scale pseudo-ground-truth depth maps, effectively bridging the data gap. Second, we propose a hybrid supervision strategy that couples our pseudo-ground-truth with geometric self-correction to enhance robustness against inherent data imperfections. Third, we introduce a hierarchical inference framework that employs two specialized models to effectively mitigate accumulated pose drift over long surgical videos: one for global stability and one for local accuracy. Experiments on the SCARED and StereoMIS datasets demonstrate that our method achieves a competitive balance between accuracy and efficiency, delivering near state-of-the-art but substantially faster pose estimation and offering a practical and effective solution for robust reconstruction in surgical environments. Project page: https://chumo-xu.github.io/SurgCUT3R-ICRA26/.
Problem

Research questions and friction points this paper is trying to address.

surgical scene reconstruction
monocular endoscopic video
3D reconstruction
long video sequences
supervised training data
Innovation

Methods, ideas, or system contributions that make the work stand out.

surgical scene reconstruction
pseudo-ground-truth depth
hybrid supervision
hierarchical inference
pose drift mitigation
🔎 Similar Papers
No similar papers found.
K
Kaiyuan Xu
The Hamlyn Centre for Robotic Surgery, Imperial College London, SW7 2AZ, UK.
Fangzhou Hong
Fangzhou Hong
Nanyang Technological University
3D Computer Vision
D
Daniel Elson
The Hamlyn Centre for Robotic Surgery, Imperial College London, SW7 2AZ, UK.
Baoru Huang
Baoru Huang
University of Liverpool; Imperial College London
RoboticsComputer visionSurgical visionImage-Guided Intervention