๐ค AI Summary
Dynamic objects, illumination changes, and complex occlusions in real-world scenes significantly degrade the reconstruction quality of 3D Gaussian Splatting (3D-GS). To address this without relying on vision foundation models (VFMs), we propose an efficient, high-fidelity reconstruction framework. Our method introduces two key innovations: (1) a novel deformable transient field jointly optimized with superpixel-aware masks to explicitly model occlusion boundaries; and (2) an uncertainty-aware densification strategy that actively suppresses Gaussian creation in occluded regions. The entire pipeline is end-to-end differentiable and memory-efficient. Extensive evaluations on multiple benchmarks demonstrate that our approach outperforms existing VFM-free methods in static reconstruction fidelity, robustness to transient elements, and geometric accuracyโwhile also achieving superior computational efficiency.
๐ Abstract
Recently, 3D Gaussian Splatting (3D-GS) has emerged, showing real-time rendering speeds and high-quality results in static scenes. Although 3D-GS shows effectiveness in static scenes, their performance significantly degrades in real-world environments due to transient objects, lighting variations, and diverse levels of occlusion. To tackle this, existing methods estimate occluders or transient elements by leveraging pre-trained models or integrating additional transient field pipelines. However, these methods still suffer from two defects: 1) Using semantic features from the Vision Foundation model (VFM) causes additional computational costs. 2) The transient field requires significant memory to handle transient elements with per-view Gaussians and struggles to define clear boundaries for occluders, solely relying on photometric errors. To address these problems, we propose ForestSplats, a novel approach that leverages the deformable transient field and a superpixel-aware mask to efficiently represent transient elements in the 2D scene across unconstrained image collections and effectively decompose static scenes from transient distractors without VFM. We designed the transient field to be deformable, capturing per-view transient elements. Furthermore, we introduce a superpixel-aware mask that clearly defines the boundaries of occluders by considering photometric errors and superpixels. Additionally, we propose uncertainty-aware densification to avoid generating Gaussians within the boundaries of occluders during densification. Through extensive experiments across several benchmark datasets, we demonstrate that ForestSplats outperforms existing methods without VFM and shows significant memory efficiency in representing transient elements.