DisCo3D: Distilling Multi-View Consistency for 3D Scene Editing

📅 2025-08-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the core challenge of preserving multi-view consistency in 3D scene editing. We propose a consistency-aware knowledge distillation framework that explicitly transfers multi-view 3D priors—derived from a fine-tuned multi-view 3D generator and Gaussian splatting reconstruction—into a 2D diffusion-based editor, thereby enabling the 2D editing process to implicitly respect 3D geometric constraints. Unlike prior approaches, our method requires neither iterative optimization nor explicit 3D representations, significantly improving editing efficiency and cross-view consistency while mitigating structural misalignment and texture distortion in complex scenes. Extensive experiments demonstrate that our approach outperforms existing state-of-the-art methods in editing fidelity, multi-view consistency, and computational efficiency. It establishes a novel paradigm for high-quality, single-step, 3D-aware image editing.

Technology Category

Application Category

📝 Abstract
While diffusion models have demonstrated remarkable progress in 2D image generation and editing, extending these capabilities to 3D editing remains challenging, particularly in maintaining multi-view consistency. Classical approaches typically update 3D representations through iterative refinement based on a single editing view. However, these methods often suffer from slow convergence and blurry artifacts caused by cross-view inconsistencies. Recent methods improve efficiency by propagating 2D editing attention features, yet still exhibit fine-grained inconsistencies and failure modes in complex scenes due to insufficient constraints. To address this, we propose extbf{DisCo3D}, a novel framework that distills 3D consistency priors into a 2D editor. Our method first fine-tunes a 3D generator using multi-view inputs for scene adaptation, then trains a 2D editor through consistency distillation. The edited multi-view outputs are finally optimized into 3D representations via Gaussian Splatting. Experimental results show DisCo3D achieves stable multi-view consistency and outperforms state-of-the-art methods in editing quality.
Problem

Research questions and friction points this paper is trying to address.

Extending 2D diffusion models to 3D scene editing
Maintaining multi-view consistency in 3D editing
Reducing slow convergence and blurry artifacts
Innovation

Methods, ideas, or system contributions that make the work stand out.

Distilling 3D consistency priors into 2D editor
Fine-tuning 3D generator with multi-view inputs
Optimizing edited outputs via Gaussian Splatting
🔎 Similar Papers
No similar papers found.
Yufeng Chi
Yufeng Chi
University of California, Berkeley
RoboticsComputer ArchitectureReinforcement Learning
Huimin Ma
Huimin Ma
清华大学 电子工程系 副教授
Kafeng Wang
Kafeng Wang
Tsinghua University
Machine LearningDeep Learning
J
Jianmin Li
Institute for Artificial Intelligence, Beijing National Research Center for Information Science and Technology, Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China