🤖 AI Summary
Existing sparse-view 3D editing methods rely on test-time iterative optimization, resulting in high computational costs, cross-view inconsistency, and limited generalization. This work proposes a feed-forward 3D editing framework that eliminates the need for per-scene optimization at test time by incorporating cross-view image-domain regularization and geometric alignment constraints during training. Leveraging text-guided editing, multi-view joint supervision, and a 3D Gaussian splatting representation, the method generates consistent and high-fidelity 3D content without scene-specific refinement. The approach significantly improves cross-view consistency, achieves inference speeds several orders of magnitude faster than existing methods, and maintains high editing fidelity.
📝 Abstract
Recent advances in text-guided image editing and 3D Gaussian Splatting (3DGS) have enabled high-quality 3D scene manipulation. However, existing pipelines rely on iterative edit-and-fit optimization at test time, alternating between 2D diffusion editing and 3D reconstruction. This process is computationally expensive, scene-specific, and prone to cross-view inconsistencies.
We propose a feed-forward framework for cross-view consistent 3D scene editing from sparse views. Instead of enforcing consistency through iterative 3D refinement, we introduce a cross-view regularization scheme in the image domain during training. By jointly supervising multi-view edits with geometric alignment constraints, our model produces view-consistent results without per-scene optimization at inference. The edited views are then lifted into 3D via a feedforward 3DGS model, yielding a coherent 3DGS representation in a single forward pass.
Experiments demonstrate competitive editing fidelity and substantially improved cross-view consistency compared to optimization-based methods, while reducing inference time by orders of magnitude.