🤖 AI Summary
This work addresses the significant performance degradation of existing disturbance-free 3D Gaussian splatting methods under sparse-view conditions, which stems from their reliance on unreliable color residual heuristics. To overcome this limitation, we propose a novel approach that integrates priors from a geometric foundation model (VGGT) and a vision-language model (VLM). Specifically, we leverage VGGT’s attention maps for the first time to enable semantic entity matching, while employing the VLM to identify large-scale static regions, thereby effectively suppressing transient disturbances. Our method substantially enhances the robustness and accuracy of disturbance-free 3D reconstruction from sparse inputs. Extensive experiments demonstrate its superior performance compared to existing approaches.
📝 Abstract
3D Gaussian Splatting (3DGS) enables efficient training and fast novel view synthesis in static environments. To address challenges posed by transient objects, distractor-free 3DGS methods have emerged and shown promising results when dense image captures are available. However, their performance degrades significantly under sparse input conditions. This limitation primarily stems from the reliance on the color residual heuristics to guide the training, which becomes unreliable with limited observations. In this work, we propose a framework to enhance distractor-free 3DGS under sparse-view conditions by incorporating rich prior information. Specifically, we first adopt the geometry foundation model VGGT to estimate camera parameters and generate a dense set of initial 3D points. Then, we harness the attention maps from VGGT for efficient and accurate semantic entity matching. Additionally, we utilize Vision-Language Models (VLMs) to further identify and preserve the large static regions in the scene. We also demonstrate how these priors can be seamlessly integrated into existing distractor-free 3DGS methods. Extensive experiments confirm the effectiveness and robustness of our approach in mitigating transient distractors for sparse-view 3DGS training.