🤖 AI Summary
Current single-step image editing methods struggle with ambiguous user intent, complex transformations, and scenarios requiring iterative refinement, often yielding inconsistent results. This paper introduces the first consistency-aware framework for multi-round interactive image editing. Our method addresses these challenges via four key innovations: (1) flow-matching-based precise image inversion to ensure high-fidelity initial edits; (2) an adaptive attention highlighting mechanism that dynamically localizes editable regions; (3) a dual-objective linear quadratic regulator (LQR)-guided stable sampling strategy that explicitly models and suppresses error accumulation across rounds; and (4) attention modulation informed by Transformer layer-role analysis to enhance cross-round semantic consistency. Extensive experiments demonstrate that our approach significantly outperforms single-step baselines in both multi-round editing success rate and visual fidelity, establishing a new paradigm for iterative, controllable, and highly consistent interactive image editing.
📝 Abstract
Many real-world applications, such as interactive photo retouching, artistic content creation, and product design, require flexible and iterative image editing. However, existing image editing methods primarily focus on achieving the desired modifications in a single step, which often struggles with ambiguous user intent, complex transformations, or the need for progressive refinements. As a result, these methods frequently produce inconsistent outcomes or fail to meet user expectations. To address these challenges, we propose a multi-turn image editing framework that enables users to iteratively refine their edits, progressively achieving more satisfactory results. Our approach leverages flow matching for accurate image inversion and a dual-objective Linear Quadratic Regulators (LQR) for stable sampling, effectively mitigating error accumulation. Additionally, by analyzing the layer-wise roles of transformers, we introduce a adaptive attention highlighting method that enhances editability while preserving multi-turn coherence. Extensive experiments demonstrate that our framework significantly improves edit success rates and visual fidelity compared to existing methods.