๐ค AI Summary
Pretrained 2D image editing models suffer from inter-view inconsistency in multi-view editing. Existing explicit 3D optimization-based approaches incur high computational overhead and exhibit instability under sparse-view conditions. This paper proposes a training-free, plug-and-play framework for inference-time multi-view consistent editing. By coupling diffusion sampling across views, it jointly models the multi-view image distribution and editing objectives, imposing implicit 3D regularization on pretrained 2D editorsโusing synchronized cross-view sampling as a geometric consistency constraint, thereby bypassing explicit 3D reconstruction. The method requires only a single forward sampling pass, significantly improving both geometric coherence and visual fidelity of edited results. We validate its generality and architecture-agnosticism across three distinct multi-view editing tasks. Our approach establishes a new paradigm for efficient and robust 3D-aware image editing.
๐ Abstract
We present an inference-time diffusion sampling method to perform multi-view consistent image editing using pre-trained 2D image editing models. These models can independently produce high-quality edits for each image in a set of multi-view images of a 3D scene or object, but they do not maintain consistency across views. Existing approaches typically address this by optimizing over explicit 3D representations, but they suffer from a lengthy optimization process and instability under sparse view settings. We propose an implicit 3D regularization approach by constraining the generated 2D image sequences to adhere to a pre-trained multi-view image distribution. This is achieved through coupled diffusion sampling, a simple diffusion sampling technique that concurrently samples two trajectories from both a multi-view image distribution and a 2D edited image distribution, using a coupling term to enforce the multi-view consistency among the generated images. We validate the effectiveness and generality of this framework on three distinct multi-view image editing tasks, demonstrating its applicability across various model architectures and highlighting its potential as a general solution for multi-view consistent editing.