🤖 AI Summary
This work addresses the challenge of achieving multi-view consistency in 3D scene editing using 2D diffusion models, a task hindered by the scarcity of paired 3D-consistent editing data for supervised fine-tuning. To overcome this limitation, the authors propose RL3DEdit, a novel framework that, for the first time, integrates reinforcement learning with the 3D foundation model VGGT to enable high-quality single-step 3D editing without requiring paired training data. RL3DEdit leverages geometric priors from VGGT to construct a verifiable 3D consistency reward signal, which guides the 2D diffusion model toward generating geometrically coherent edits. Experimental results demonstrate that RL3DEdit significantly outperforms existing methods in both multi-view consistency and overall editing quality, offering an efficient and stable solution for 3D scene editing. The code and models are publicly released.
📝 Abstract
Leveraging the priors of 2D diffusion models for 3D editing has emerged as a promising paradigm. However, maintaining multi-view consistency in edited results remains challenging, and the extreme scarcity of 3D-consistent editing paired data renders supervised fine-tuning (SFT), the most effective training strategy for editing tasks, infeasible. In this paper, we observe that, while generating multi-view consistent 3D content is highly challenging, verifying 3D consistency is tractable, naturally positioning reinforcement learning (RL) as a feasible solution. Motivated by this, we propose \textbf{RL3DEdit}, a single-pass framework driven by RL optimization with novel rewards derived from the 3D foundation model, VGGT. Specifically, we leverage VGGT's robust priors learned from massive real-world data, feed the edited images, and utilize the output confidence maps and pose estimation errors as reward signals, effectively anchoring the 2D editing priors onto a 3D-consistent manifold via RL. Extensive experiments demonstrate that RL3DEdit achieves stable multi-view consistency and outperforms state-of-the-art methods in editing quality with high efficiency. To promote the development of 3D editing, we will release the code and model.