Geometry-Guided Reinforcement Learning for Multi-view Consistent 3D Scene Editing

📅 2026-03-03

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

This work addresses the challenge of achieving multi-view consistency in 3D scene editing using 2D diffusion models, a task hindered by the scarcity of paired 3D-consistent editing data for supervised fine-tuning. To overcome this limitation, the authors propose RL3DEdit, a novel framework that, for the first time, integrates reinforcement learning with the 3D foundation model VGGT to enable high-quality single-step 3D editing without requiring paired training data. RL3DEdit leverages geometric priors from VGGT to construct a verifiable 3D consistency reward signal, which guides the 2D diffusion model toward generating geometrically coherent edits. Experimental results demonstrate that RL3DEdit significantly outperforms existing methods in both multi-view consistency and overall editing quality, offering an efficient and stable solution for 3D scene editing. The code and models are publicly released.

Technology Category

Application Category

📝 Abstract

Leveraging the priors of 2D diffusion models for 3D editing has emerged as a promising paradigm. However, maintaining multi-view consistency in edited results remains challenging, and the extreme scarcity of 3D-consistent editing paired data renders supervised fine-tuning (SFT), the most effective training strategy for editing tasks, infeasible. In this paper, we observe that, while generating multi-view consistent 3D content is highly challenging, verifying 3D consistency is tractable, naturally positioning reinforcement learning (RL) as a feasible solution. Motivated by this, we propose \textbf{RL3DEdit}, a single-pass framework driven by RL optimization with novel rewards derived from the 3D foundation model, VGGT. Specifically, we leverage VGGT's robust priors learned from massive real-world data, feed the edited images, and utilize the output confidence maps and pose estimation errors as reward signals, effectively anchoring the 2D editing priors onto a 3D-consistent manifold via RL. Extensive experiments demonstrate that RL3DEdit achieves stable multi-view consistency and outperforms state-of-the-art methods in editing quality with high efficiency. To promote the development of 3D editing, we will release the code and model.

Problem

Research questions and friction points this paper is trying to address.

multi-view consistency

3D scene editing

2D diffusion models

3D-consistent editing

supervised fine-tuning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement Learning

Multi-view Consistency

3D Scene Editing