ConsistEdit: Highly Consistent and Precise Training-free Visual Editing

📅 2025-10-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing training-free attention control methods struggle to preserve source content consistency under strong editing intensities—especially in multi-step or video editing, where errors accumulate progressively—while global consistency constraints impede fine-grained, attribute-specific editing (e.g., texture). This paper introduces the first fully automated, hierarchical attention modulation framework tailored for MM-DiT architectures. It operates across all diffusion sampling steps and attention layers, enabling vision-specific query-key-value token differentiation, mask-guided pre-attention fusion, and cross-modal attention decoupling. The method supports controllable, fine-grained editing in both structurally consistent and inconsistent scenarios, with progressive adjustment of structural fidelity. Experiments demonstrate state-of-the-art performance on multi-round and multi-region continuous editing tasks for both images and videos, significantly improving editing accuracy and temporal stability.

Technology Category

Application Category

📝 Abstract
Recent advances in training-free attention control methods have enabled flexible and efficient text-guided editing capabilities for existing generation models. However, current approaches struggle to simultaneously deliver strong editing strength while preserving consistency with the source. This limitation becomes particularly critical in multi-round and video editing, where visual errors can accumulate over time. Moreover, most existing methods enforce global consistency, which limits their ability to modify individual attributes such as texture while preserving others, thereby hindering fine-grained editing. Recently, the architectural shift from U-Net to MM-DiT has brought significant improvements in generative performance and introduced a novel mechanism for integrating text and vision modalities. These advancements pave the way for overcoming challenges that previous methods failed to resolve. Through an in-depth analysis of MM-DiT, we identify three key insights into its attention mechanisms. Building on these, we propose ConsistEdit, a novel attention control method specifically tailored for MM-DiT. ConsistEdit incorporates vision-only attention control, mask-guided pre-attention fusion, and differentiated manipulation of the query, key, and value tokens to produce consistent, prompt-aligned edits. Extensive experiments demonstrate that ConsistEdit achieves state-of-the-art performance across a wide range of image and video editing tasks, including both structure-consistent and structure-inconsistent scenarios. Unlike prior methods, it is the first approach to perform editing across all inference steps and attention layers without handcraft, significantly enhancing reliability and consistency, which enables robust multi-round and multi-region editing. Furthermore, it supports progressive adjustment of structural consistency, enabling finer control.
Problem

Research questions and friction points this paper is trying to address.

Achieving strong editing strength while preserving source consistency
Preventing visual error accumulation in multi-round and video editing
Enabling fine-grained attribute modification without global consistency limitations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-only attention control for MM-DiT architecture
Mask-guided pre-attention fusion mechanism implementation
Differentiated manipulation of query key value tokens
🔎 Similar Papers
No similar papers found.
Z
Zixin Yin
Hong Kong University of Science and Technology
Ling-Hao Chen
Ling-Hao Chen
Ph.D. Student, Tsinghua University, IDEA Research
Computer GraphicsComputer VisionCharacter Animation
L
Lionel Ni
Hong Kong University of Science and Technology, Guangzhou and Hong Kong University of Science and Technology
Xili Dai
Xili Dai
UC Berkeley; HKUST
computer vision