🤖 AI Summary
Addressing challenges in editing 3D Gaussian Splatting (3DGS) scenes—including difficulty in local region editing, inaccurate semantic localization, and view inconsistency—this paper proposes an iterative editing framework guided by 2D priors. Specifically, it leverages 2D diffusion models for pixel-accurate edit region localization, integrates inverse rendering with foundation-model-predicted depth maps for 3D spatial mapping, initializes a view-consistent coarse geometry, and jointly optimizes Gaussian primitives’ geometry and appearance. By bypassing low-fidelity 3D semantic parsing and instead harnessing strong 2D priors to guide 3D editing, the method achieves high fidelity and efficiency while preserving cross-view consistency. Experiments demonstrate state-of-the-art performance across diverse scenes: editing speed improves by 4× over existing approaches, with superior detail preservation and geometric coherence.
📝 Abstract
Many 3D scene editing tasks focus on modifying local regions rather than the entire scene, except for some global applications like style transfer, and in the context of 3D Gaussian Splatting (3DGS), where scenes are represented by a series of Gaussians, this structure allows for precise regional edits, offering enhanced control over specific areas of the scene; however, the challenge lies in the fact that 3D semantic parsing often underperforms compared to its 2D counterpart, making targeted manipulations within 3D spaces more difficult and limiting the fidelity of edits, which we address by leveraging 2D diffusion editing to accurately identify modification regions in each view, followed by inverse rendering for 3D localization, then refining the frontal view and initializing a coarse 3DGS with consistent views and approximate shapes derived from depth maps predicted by a 2D foundation model, thereby supporting an iterative, view-consistent editing process that gradually enhances structural details and textures to ensure coherence across perspectives. Experiments demonstrate that our method achieves state-of-the-art performance while delivering up to a $4 imes$ speedup, providing a more efficient and effective approach to 3D scene local editing.