🤖 AI Summary
Existing text-guided image editing methods based on diffusion models suffer from significant approximation errors during image inversion due to the lack of precise supervision over intermediate representations, thereby compromising editing fidelity and text-image alignment. To address this, we propose EditInfinity—a novel framework that introduces binary-quantized generative modeling (specifically, the Infinity model built upon VQ-VAE) into text-guided editing for the first time, enabling exact supervision of intermediate latent representations during inversion. EditInfinity incorporates a text-guided inversion mechanism, a prompt correction module, a style preservation strategy, and multi-level smoothing optimization. With minimal parameter overhead, it achieves efficient and high-fidelity editing. On the PIE-Bench benchmark across “add/modify/remove” tasks, EditInfinity consistently outperforms state-of-the-art diffusion-based methods, markedly improving both visual fidelity and semantic consistency with input prompts.
📝 Abstract
Adapting pretrained diffusion-based generative models for text-driven image editing with negligible tuning overhead has demonstrated remarkable potential. A classical adaptation paradigm, as followed by these methods, first infers the generative trajectory inversely for a given source image by image inversion, then performs image editing along the inferred trajectory guided by the target text prompts. However, the performance of image editing is heavily limited by the approximation errors introduced during image inversion by diffusion models, which arise from the absence of exact supervision in the intermediate generative steps. To circumvent this issue, we investigate the parameter-efficient adaptation of VQ-based generative models for image editing, and leverage their inherent characteristic that the exact intermediate quantized representations of a source image are attainable, enabling more effective supervision for precise image inversion. Specifically, we propose emph{EditInfinity}, which adapts emph{Infinity}, a binary-quantized generative model, for image editing. We propose an efficient yet effective image inversion mechanism that integrates text prompting rectification and image style preservation, enabling precise image inversion. Furthermore, we devise a holistic smoothing strategy which allows our emph{EditInfinity} to perform image editing with high fidelity to source images and precise semantic alignment to the text prompts. Extensive experiments on the PIE-Bench benchmark across "add", "change", and "delete" editing operations, demonstrate the superior performance of our model compared to state-of-the-art diffusion-based baselines. Code available at: https://github.com/yx-chen-ust/EditInfinity.