EditTransfer++: Toward Faithful and Efficient Visual-Prompt-Guided Image Editing

📅 2026-05-08

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

Existing diffusion Transformer-based methods for visual prompt-guided image editing are hindered by textual conditioning bias and sampling stochasticity, making it difficult to faithfully reproduce example-based edits. This work proposes a text-decoupled fine-tuning strategy that compels the model to learn transformations solely from visual prompts, thereby mitigating reliance on textual cues. To further enhance editing consistency, we introduce an best–worst contrastive refinement mechanism, alongside a conditional token compression technique to accelerate inference. The proposed approach maintains optional text guidance while significantly improving both editing fidelity and generation efficiency at high resolutions (1024-pixel long side). It achieves state-of-the-art performance on established benchmarks as well as on EditTransfer-Bench, a newly introduced evaluation suite.

📝 Abstract

Visual-prompt-guided edit transfer aims to learn image transformations directly from example pairs, offering more precise and controllable editing than purely text-driven approaches. However, existing diffusion transformer-based methods often fail to faithfully reproduce the demonstrated edits due to structural mismatches between the task and the backbone, including a pretrained bias toward textual conditioning and inherent stochastic instability during sampling. To bridge this gap, we present EditTransfer++, a framework that combines progressively structured training with an efficient conditioning scheme to improve both visual prompt faithfulness and inference efficiency. We first mitigate textual dominance with a text-decoupled training strategy that removes text conditioning during fine-tuning, compelling the model to infer transformations solely from visual evidence while still supporting optional text guidance at inference. On top of this visually grounded model, a best-worst contrastive refinement mechanism reshapes the denoising trajectories to suppress unfaithful generations and improve consistency across random seeds. To alleviate the computational bottleneck of high-resolution in-context editing, we further introduce a condition compression and reuse strategy that reduces token redundancy and enables efficient generation of images with a 1024-pixel long edge. Extensive experiments on existing benchmarks and the proposed EditTransfer-Bench show that EditTransfer++ achieves state-of-the-art visual prompt faithfulness with substantially faster inference than prior methods, suggesting a promising direction for scalable prompt-guided image editing and broader visual in-context learning.

Problem

Research questions and friction points this paper is trying to address.

visual-prompt-guided editing

edit transfer

diffusion transformer

faithfulness

in-context image editing

Innovation

Methods, ideas, or system contributions that make the work stand out.

visual-prompt-guided editing

text-decoupled training

contrastive refinement