TV-LiVE: Training-Free, Text-Guided Video Editing via Layer Informed Vitality Exploitation

📅 2025-06-08

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

Current video editing methods struggle with complex tasks—such as inserting novel objects or modeling non-rigid deformations—due to their reliance on computationally expensive fine-tuning, optimization, or additional training. To address this, we propose a zero-shot, text-driven video editing framework. Our core innovation is the first identification and definition of “vital layers” in diffusion models—those layers whose representations are significantly modulated by Rotary Position Embedding (RoPE). We selectively transfer key-value (KV) features from the source video within these vital layers and integrate them with vital-layer-guided mask extraction to achieve precise spatiotemporal alignment. The method requires no fine-tuning, gradient-based optimization, or auxiliary training. It enables high-fidelity object insertion and non-rigid motion editing. Extensive experiments demonstrate substantial improvements over state-of-the-art approaches in editing fidelity, temporal consistency, and mask accuracy, establishing a new paradigm for controllable video editing.

Technology Category

Application Category

📝 Abstract

Video editing has garnered increasing attention alongside the rapid progress of diffusion-based video generation models. As part of these advancements, there is a growing demand for more accessible and controllable forms of video editing, such as prompt-based editing. Previous studies have primarily focused on tasks such as style transfer, background replacement, object substitution, and attribute modification, while maintaining the content structure of the source video. However, more complex tasks, including the addition of novel objects and nonrigid transformations, remain relatively unexplored. In this paper, we present TV-LiVE, a Training-free and text-guided Video editing framework via Layerinformed Vitality Exploitation. We empirically identify vital layers within the video generation model that significantly influence the quality of generated outputs. Notably, these layers are closely associated with Rotary Position Embeddings (RoPE). Based on this observation, our method enables both object addition and non-rigid video editing by selectively injecting key and value features from the source model into the corresponding layers of the target model guided by the layer vitality. For object addition, we further identify prominent layers to extract the mask regions corresponding to the newly added target prompt. We found that the extracted masks from the prominent layers faithfully indicate the region to be edited. Experimental results demonstrate that TV-LiVE outperforms existing approaches for both object addition and non-rigid video editing. Project Page: https://emjay73.github.io/TV_LiVE/

Problem

Research questions and friction points this paper is trying to address.

Enables text-guided video editing without training

Supports complex tasks like object addition and non-rigid transformations

Identifies vital layers in diffusion models for precise editing

Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free video editing via layer vitality

Text-guided object addition using prominent layers

Non-rigid editing via selective feature injection

🔎 Similar Papers

No similar papers found.