🤖 AI Summary
This work addresses the challenge of achieving both high-fidelity detail and temporal consistency in dynamic 3D scene reconstruction. Existing Gaussian splatting methods often overfit to instantaneous states due to per-frame optimization. To overcome this limitation, we propose a multi-frame anchor-guided 4D Gaussian splatting framework that models inter-frame motion dependencies within short temporal windows using sparse control anchors. By decoupling canonical positions from latent codes as stable semantic anchors, our approach effectively mitigates correspondence drift under large motions. Combined with an input masking strategy and multi-frame consistency losses, the method enables end-to-end training for high-fidelity dynamic reconstruction. Our approach achieves state-of-the-art performance in both reconstruction quality and real-time rendering speed, supporting interactive rendering of highly detailed dynamic scenes.
📝 Abstract
Reconstructing dynamic 3D scenes with photorealistic detail and strong temporal coherence remains a significant challenge. Existing Gaussian splatting approaches for dynamic scene modeling often rely on per-frame optimization, which can overfit to instantaneous states instead of capturing underlying motion dynamics. To address this, we present Mango-GS, a multi-frame, node-guided framework for high-fidelity 4D reconstruction. Mango-GS leverages a temporal Transformer to model motion dependencies within a short window of frames, producing temporally consistent deformations. For efficiency, temporal modeling is confined to a sparse set of control nodes. Each node is represented by a decoupled canonical position and a latent code, providing a stable semantic anchor for motion propagation and preventing correspondence drift under large motion. Our framework is trained end-to-end, enhanced by an input masking strategy and two multi-frame losses to improve robustness. Extensive experiments demonstrate that Mango-GS achieves state-of-the-art reconstruction quality and real-time rendering speed, enabling high-fidelity reconstruction and interactive rendering of dynamic scenes.