ProEdit: Inversion-based Editing From Prompts Done Right

📅 2025-12-26

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

Existing inversion-based visual editing methods suffer from excessive dependence on the source image during sampling, hindering precise text-guided manipulation of target attributes such as pose, quantity, and color. To address this, we propose (1) KV-mix attention—a cross-attention mechanism that fuses source and target key-value features to decouple source-image interference—and (2) Latents-Shift—a region-aware latent-space perturbation strategy applied to source latents to enhance edit controllability. Together, these techniques balance editing fidelity and semantic controllability. Our approach is plug-and-play compatible with mainstream frameworks including RF-Solver, FireFlow, and UniEdit. Evaluated on multiple image and video editing benchmarks, it achieves state-of-the-art performance, significantly improving attribute editing accuracy and background consistency.

Technology Category

Application Category

📝 Abstract

Inversion-based visual editing provides an effective and training-free way to edit an image or a video based on user instructions. Existing methods typically inject source image information during the sampling process to maintain editing consistency. However, this sampling strategy overly relies on source information, which negatively affects the edits in the target image (e.g., failing to change the subject's atributes like pose, number, or color as instructed). In this work, we propose ProEdit to address this issue both in the attention and the latent aspects. In the attention aspect, we introduce KV-mix, which mixes KV features of the source and the target in the edited region, mitigating the influence of the source image on the editing region while maintaining background consistency. In the latent aspect, we propose Latents-Shift, which perturbs the edited region of the source latent, eliminating the influence of the inverted latent on the sampling. Extensive experiments on several image and video editing benchmarks demonstrate that our method achieves SOTA performance. In addition, our design is plug-and-play, which can be seamlessly integrated into existing inversion and editing methods, such as RF-Solver, FireFlow and UniEdit.

Problem

Research questions and friction points this paper is trying to address.

Addresses over-reliance on source image information in inversion-based editing

Mitigates source influence on target edits while maintaining background consistency

Eliminates inverted latent influence on sampling for improved attribute changes

Innovation

Methods, ideas, or system contributions that make the work stand out.

KV-mix blends source and target features for editing

Latents-Shift perturbs source latent to reduce influence

Plug-and-play design integrates into existing inversion methods

🔎 Similar Papers

GenAudit: Fixing Factual Errors in Language Model Outputs with Evidence