LoVoRA: Text-guided and Mask-free Video Object Removal and Addition with Learnable Object-aware Localization

📅 2025-12-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Text-guided video editing (e.g., object removal/addition) suffers from challenges in preserving spatiotemporal consistency and often relies on manually annotated masks or reference images. This paper introduces the first end-to-end, text-driven video editing framework that requires neither external masks nor reference images. Our approach features three key innovations: (1) a learnable object-aware localization mechanism that implicitly models the semantic spatial and temporal positions of the target object; (2) a tightly coupled mask propagation module guided by optical flow and a diffusion-based mask predictor, enabling accurate and temporally coherent self-generation of editing regions; and (3) an integrated architecture unifying image-to-video generation and video inpainting for joint optimization of visual fidelity and temporal coherence. Extensive quantitative and human evaluations across multiple benchmarks demonstrate significant improvements over state-of-the-art baselines, producing high-fidelity, temporally smooth edited videos.

Technology Category

Application Category

📝 Abstract
Text-guided video editing, particularly for object removal and addition, remains a challenging task due to the need for precise spatial and temporal consistency. Existing methods often rely on auxiliary masks or reference images for editing guidance, which limits their scalability and generalization. To address these issues, we propose LoVoRA, a novel framework for mask-free video object removal and addition using object-aware localization mechanism. Our approach utilizes a unique dataset construction pipeline that integrates image-to-video translation, optical flow-based mask propagation, and video inpainting, enabling temporally consistent edits. The core innovation of LoVoRA is its learnable object-aware localization mechanism, which provides dense spatio-temporal supervision for both object insertion and removal tasks. By leveraging a Diffusion Mask Predictor, LoVoRA achieves end-to-end video editing without requiring external control signals during inference. Extensive experiments and human evaluation demonstrate the effectiveness and high-quality performance of LoVoRA.
Problem

Research questions and friction points this paper is trying to address.

Achieves mask-free video object removal and addition using text guidance
Ensures spatio-temporal consistency without auxiliary masks or reference images
Provides end-to-end video editing via learnable object-aware localization mechanism
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mask-free video editing with learnable object-aware localization
Dataset pipeline integrates image-to-video translation and optical flow
End-to-end editing using Diffusion Mask Predictor without external controls
🔎 Similar Papers
Z
Zhihan Xiao
Tsinghua University
L
Lin Liu
Huawei Inc.
Yixin Gao
Yixin Gao
University of Science and Technology of China
Computer VisionLearned image compression
X
Xiaopeng Zhang
Huawei Inc.
Haoxuan Che
Haoxuan Che
Hong Kong University of Science and Technology
Interactive Video GenerationModel Generalization
S
Songping Mai
Tsinghua University
Q
Qi Tian
Huawei Inc.