π€ AI Summary
This work addresses the challenge of high-fidelity 4D reconstruction of deformable surgical scenes from monocular endoscopic videos under arbitrary camera motionβa task hindered by the limitations of existing methods that rely on fixed viewpoints, stereo depth, or precise motion initialization, rendering them unsuitable for real clinical settings. To overcome these constraints, we propose Local-EndoGS, a framework that constructs local deformable 3D Gaussian Splatting models within a sliding window. It integrates coarse-to-fine robust initialization, multi-view geometric constraints, monocular depth priors, and cross-window information fusion, further enhanced by long-range pixel trajectories and physical motion priors to ensure plausible deformations. Our method enables scalable 4D reconstruction under arbitrary camera trajectories without requiring stereo inputs or accurate structure-from-motion (SfM). Experiments demonstrate consistent superiority over state-of-the-art approaches across three public deformable endoscopic datasets, with ablation studies confirming the contribution of each component.
π Abstract
Reconstructing deformable surgical scenes from endoscopic videos is challenging and clinically important. Recent state-of-the-art methods based on implicit neural representations or 3D Gaussian splatting have made notable progress. However, most are designed for deformable scenes with fixed endoscope viewpoints and rely on stereo depth priors or accurate structure-from-motion for initialization and optimization, limiting their ability to handle monocular sequences with large camera motion in real clinical settings. To address this, we propose Local-EndoGS, a high-quality 4D reconstruction framework for monocular endoscopic sequences with arbitrary camera motion. Local-EndoGS introduces a progressive, window-based global representation that allocates local deformable scene models to each observed window, enabling scalability to long sequences with substantial motion. To overcome unreliable initialization without stereo depth or accurate structure-from-motion, we design a coarse-to-fine strategy integrating multi-view geometry, cross-window information, and monocular depth priors, providing a robust foundation for optimization. We further incorporate long-range 2D pixel trajectory constraints and physical motion priors to improve deformation plausibility. Experiments on three public endoscopic datasets with deformable scenes and varying camera motions show that Local-EndoGS consistently outperforms state-of-the-art methods in appearance quality and geometry. Ablation studies validate the effectiveness of our key designs. Code will be released upon acceptance at: https://github.com/IRMVLab/Local-EndoGS.