🤖 AI Summary
This work addresses the challenge of maintaining visual consistency across multiple rounds of video editing in existing video-to-video diffusion models. To this end, it introduces the first cross-editing consistency framework tailored for iterative video editing, leveraging an explicit external memory mechanism that integrates precise retrieval with a dynamic tokenization strategy to provide historical consistency cues for the current editing step. A learnable token compressor is further embedded within the DiT backbone to efficiently reduce redundant conditional tokens, significantly enhancing computational efficiency without compromising consistency. Experiments demonstrate that the proposed method substantially outperforms state-of-the-art approaches in tasks such as novel view synthesis and text-guided long-form video editing, achieving markedly improved cross-editing consistency, nearly unchanged computational overhead, and an overall speedup of approximately 30%.
📝 Abstract
Recent foundational video-to-video diffusion models have achieved impressive results in editing user provided videos by modifying appearance, motion, or camera movement. However, real-world video editing is often an iterative process, where users refine results across multiple rounds of interaction. In this multi-turn setting, current video editors struggle to maintain cross-consistency across sequential edits. In this work, we tackle, for the first time, the problem of cross-consistency in multi-turn video editing and introduce Memory-V2V, a simple, yet effective framework that augments existing video-to-video models with explicit memory. Given an external cache of previously edited videos, Memory-V2V employs accurate retrieval and dynamic tokenization strategies to condition the current editing step on prior results. To further mitigate redundancy and computational overhead, we propose a learnable token compressor within the DiT backbone that compresses redundant conditioning tokens while preserving essential visual cues, achieving an overall speedup of 30%. We validate Memory-V2V on challenging tasks including video novel view synthesis and text-conditioned long video editing. Extensive experiments show that Memory-V2V produces videos that are significantly more cross-consistent with minimal computational overhead, while maintaining or even improving task-specific performance over state-of-the-art baselines. Project page: https://dohunlee1.github.io/MemoryV2V