🤖 AI Summary
This paper addresses the challenge of high-fidelity 3D inpainting of occluded objects in 3D Gaussian Splatting (3DGS). To overcome limitations of single-view cues and artifacts induced by dynamic distractors (e.g., swimming fish), we propose VISTA—the first end-to-end, multi-view guided 3D Gaussian inpainting framework. VISTA introduces a novel visibility uncertainty modeling mechanism that jointly optimizes 3D Gaussian reconstruction and mask-free scene semantic concept learning. It integrates diffusion-model-driven multi-view image inpainting with 3DGS parameter optimization. Evaluated on the SPIn-NeRF and UTB180 underwater datasets, VISTA significantly outperforms state-of-the-art methods, generating 3DGS models with geometric consistency, semantic plausibility, seamless integration, and artifact-free reconstruction. Crucially, it achieves robust dynamic object replacement and high-quality novel-view synthesis—demonstrating the first such capability in 3DGS-based inpainting.
📝 Abstract
3D Gaussian Splatting (3DGS) has emerged as a powerful and efficient 3D representation for novel view synthesis. This paper extends 3DGS capabilities to inpainting, where masked objects in a scene are replaced with new contents that blend seamlessly with the surroundings. Unlike 2D image inpainting, 3D Gaussian inpainting (3DGI) is challenging in effectively leveraging complementary visual and semantic cues from multiple input views, as occluded areas in one view may be visible in others. To address this, we propose a method that measures the visibility uncertainties of 3D points across different input views and uses them to guide 3DGI in utilizing complementary visual cues. We also employ uncertainties to learn a semantic concept of scene without the masked object and use a diffusion model to fill masked objects in input images based on the learned concept. Finally, we build a novel 3DGI framework, VISTA, by integrating VISibility-uncerTainty-guided 3DGI with scene conceptuAl learning. VISTA generates high-quality 3DGS models capable of synthesizing artifact-free and naturally inpainted novel views. Furthermore, our approach extends to handling dynamic distractors arising from temporal object changes, enhancing its versatility in diverse scene reconstruction scenarios. We demonstrate the superior performance of our method over state-of-the-art techniques using two challenging datasets: the SPIn-NeRF dataset, featuring 10 diverse static 3D inpainting scenes, and an underwater 3D inpainting dataset derived from UTB180, including fast-moving fish as inpainting targets.