🤖 AI Summary
Zero-shot consistent editing of real-world videos faces three core challenges: content consistency, object integrity, and temporal stability. To address these, we propose a purely attention-driven zero-shot shape editing method built upon the Stable Diffusion architecture. Our approach enhances temporal coherence via relaxed cross-frame self-attention, while enabling localized, shape-level modifications through conditional cross-attention replacement and feature injection—requiring no fine-tuning, segmentation masks, training, or auxiliary signals (e.g., depth or optical flow). To our knowledge, this is the first method achieving high-fidelity, temporally coherent editing on 64-frame videos using attention mechanisms alone. Experiments demonstrate substantial improvements in structural stability and perceptual naturalness across diverse real-world videos. Our work establishes the first parameter-free, shape-aware, and long-range temporally consistent zero-shot video editing framework.
📝 Abstract
Even though large-scale text-to-image generative models show promising performance in synthesizing high-quality images, applying these models directly to image editing remains a significant challenge. This challenge is further amplified in video editing due to the additional dimension of time. This is especially the case for editing real-world videos as it necessitates maintaining a stable structural layout across frames while executing localized edits without disrupting the existing content. In this paper, we propose RealCraft, an attention-control-based method for zero-shot real-world video editing. By swapping cross-attention for new feature injection and relaxing spatial-temporal attention of the editing object, we achieve localized shape-wise edit along with enhanced temporal consistency. Our model directly uses Stable Diffusion and operates without the need for additional information. We showcase the proposed zero-shot attention-control-based method across a range of videos, demonstrating shape-wise, time-consistent and parameter-free editing in videos of up to 64 frames.