Fast Multi-view Consistent 3D Editing with Video Priors

📅 2025-11-28

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

Existing text-driven 3D editing methods rely on iterative 2D–3D–2D optimization, lacking explicit multi-view consistency priors, resulting in high computational cost and over-smoothed outputs. This paper introduces the first single-forward 3D editing framework leveraging pre-trained video diffusion models, which inherently encode temporal consistency as a geometric and dynamic prior across views. Our approach eliminates iterative optimization by directly mapping text prompts to consistent 3D geometry in one forward pass. Key contributions include: (1) a motion-preserving cross-view noise fusion mechanism that enforces dynamic consistency across viewpoints; and (2) a geometry-aware 3D denoising module that explicitly constrains surface fidelity during diffusion. Experiments demonstrate that our method generates high-fidelity, multi-view-consistent 3D edits in a single forward pass—achieving superior quality and efficiency compared to state-of-the-art iterative approaches.

Technology Category

Application Category

📝 Abstract

Text-driven 3D editing enables user-friendly 3D object or scene editing with text instructions. Due to the lack of multi-view consistency priors, existing methods typically resort to employing 2D generation or editing models to process each view individually, followed by iterative 2D-3D-2D updating. However, these methods are not only time-consuming but also prone to over-smoothed results because the different editing signals gathered from different views are averaged during the iterative process. In this paper, we propose generative Video Prior based 3D Editing (ViP3DE) to employ the temporal consistency priors from pre-trained video generation models for multi-view consistent 3D editing in a single forward pass. Our key insight is to condition the video generation model on a single edited view to generate other consistent edited views for 3D updating directly, thereby bypassing the iterative editing paradigm. Since 3D updating requires edited views to be paired with specific camera poses, we propose motion-preserved noise blending for the video model to generate edited views at predefined camera poses. In addition, we introduce geometry-aware denoising to further enhance multi-view consistency by integrating 3D geometric priors into video models. Extensive experiments demonstrate that our proposed ViP3DE can achieve high-quality 3D editing results even within a single forward pass, significantly outperforming existing methods in both editing quality and speed.

Problem

Research questions and friction points this paper is trying to address.

Achieving multi-view consistency in 3D editing without iterative optimization

Overcoming over-smoothed results from averaging inconsistent view edits

Generating pose-aligned edited views using video temporal consistency priors

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses video generation models for multi-view consistency

Employs motion-preserved noise blending for camera poses

Integrates geometry-aware denoising with 3D priors

🔎 Similar Papers

View-Consistent 3D Editing with Gaussian Splatting