RealCraft: Attention Control as A Tool for Zero-Shot Consistent Video Editing

📅 2023-12-19

📈 Citations: 1

✨ Influential: 0

career value

242K/year

🤖 AI Summary

Zero-shot consistent editing of real-world videos faces three core challenges: content consistency, object integrity, and temporal stability. To address these, we propose a purely attention-driven zero-shot shape editing method built upon the Stable Diffusion architecture. Our approach enhances temporal coherence via relaxed cross-frame self-attention, while enabling localized, shape-level modifications through conditional cross-attention replacement and feature injection—requiring no fine-tuning, segmentation masks, training, or auxiliary signals (e.g., depth or optical flow). To our knowledge, this is the first method achieving high-fidelity, temporally coherent editing on 64-frame videos using attention mechanisms alone. Experiments demonstrate substantial improvements in structural stability and perceptual naturalness across diverse real-world videos. Our work establishes the first parameter-free, shape-aware, and long-range temporally consistent zero-shot video editing framework.

📝 Abstract

Even though large-scale text-to-image generative models show promising performance in synthesizing high-quality images, applying these models directly to image editing remains a significant challenge. This challenge is further amplified in video editing due to the additional dimension of time. This is especially the case for editing real-world videos as it necessitates maintaining a stable structural layout across frames while executing localized edits without disrupting the existing content. In this paper, we propose RealCraft, an attention-control-based method for zero-shot real-world video editing. By swapping cross-attention for new feature injection and relaxing spatial-temporal attention of the editing object, we achieve localized shape-wise edit along with enhanced temporal consistency. Our model directly uses Stable Diffusion and operates without the need for additional information. We showcase the proposed zero-shot attention-control-based method across a range of videos, demonstrating shape-wise, time-consistent and parameter-free editing in videos of up to 64 frames.

Problem

Research questions and friction points this paper is trying to address.

Video Editing

Content Consistency

Temporal Continuity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Attention Control Mechanism

Stable Diffusion Technique

Real-world Video Editing

🔎 Similar Papers

No similar papers found.