π€ AI Summary
Existing image editing methods often rely on costly training or diffusion inversion when incorporating visual context, which limits both consistency and flexibility in edits. This work proposes VicoEdit, the first framework to enable visual context injection without requiring additional training or diffusion inversion. By directly fusing visual context from a source image into a pre-trained text-guided editing model and employing a concept-aligned guidance strategy during posterior sampling, VicoEdit significantly enhances the consistency and controllability of editing outcomes. Experimental results demonstrate that VicoEdit outperforms state-of-the-art training-based approaches across multiple metrics, achieving higher-quality and more consistent image editing results.
π Abstract
In image editing, it is essential to incorporate a context image to convey the user's precise requirements, such as subject appearance or image style. Existing training-based visual context-aware editing methods incur data collection effort and training cost. On the other hand, the training-free alternatives are typically established on diffusion inversion, which struggles with consistency and flexibility. In this work, we propose VicoEdit, a training-free and inversion-free method to inject the visual context into the pretrained text-prompted editing model. More specifically, VicoEdit directly transforms the source image into the target one based on the visual context, thereby eliminating the need for inversion that can lead to deviated trajectories. Moreover, we design a posterior sampling approach guided by concept alignment to enhance the editing consistency. Empirical results demonstrate that our training-free method achieves even better editing performance than the state-of-the-art training-based models.