🤖 AI Summary
Current video editing methods struggle with complex tasks—such as inserting novel objects or modeling non-rigid deformations—due to their reliance on computationally expensive fine-tuning, optimization, or additional training. To address this, we propose a zero-shot, text-driven video editing framework. Our core innovation is the first identification and definition of “vital layers” in diffusion models—those layers whose representations are significantly modulated by Rotary Position Embedding (RoPE). We selectively transfer key-value (KV) features from the source video within these vital layers and integrate them with vital-layer-guided mask extraction to achieve precise spatiotemporal alignment. The method requires no fine-tuning, gradient-based optimization, or auxiliary training. It enables high-fidelity object insertion and non-rigid motion editing. Extensive experiments demonstrate substantial improvements over state-of-the-art approaches in editing fidelity, temporal consistency, and mask accuracy, establishing a new paradigm for controllable video editing.
📝 Abstract
Video editing has garnered increasing attention alongside the rapid progress of diffusion-based video generation models. As part of these advancements, there is a growing demand for more accessible and controllable forms of video editing, such as prompt-based editing. Previous studies have primarily focused on tasks such as style transfer, background replacement, object substitution, and attribute modification, while maintaining the content structure of the source video. However, more complex tasks, including the addition of novel objects and nonrigid transformations, remain relatively unexplored. In this paper, we present TV-LiVE, a Training-free and text-guided Video editing framework via Layerinformed Vitality Exploitation. We empirically identify vital layers within the video generation model that significantly influence the quality of generated outputs. Notably, these layers are closely associated with Rotary Position Embeddings (RoPE). Based on this observation, our method enables both object addition and non-rigid video editing by selectively injecting key and value features from the source model into the corresponding layers of the target model guided by the layer vitality. For object addition, we further identify prominent layers to extract the mask regions corresponding to the newly added target prompt. We found that the extracted masks from the prominent layers faithfully indicate the region to be edited. Experimental results demonstrate that TV-LiVE outperforms existing approaches for both object addition and non-rigid video editing. Project Page: https://emjay73.github.io/TV_LiVE/