DIVE: Taming DINO for Subject-Driven Video Editing

๐Ÿ“… 2024-12-04
๐Ÿ›๏ธ arXiv.org
๐Ÿ“ˆ Citations: 2
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address poor temporal consistency, inaccurate motion alignment, and weak subject identity control in video editing, this paper proposes a subject-driven video editing method leveraging DINOv2 semantic features. The method jointly exploits DINOv2 features for both motion trajectory alignment and subject identity registrationโ€”a novel unification not previously explored. We further introduce a DINO-guided LoRA fine-tuning mechanism to enable text- or reference-image-driven editing with consistent subject identity across frames. By integrating optical-flow-aware feature alignment with diffusion-based temporal modeling, our approach achieves high-fidelity, temporally coherent edits on real-world videos. Quantitative and qualitative evaluations demonstrate that our method significantly outperforms existing text- and image-driven approaches, particularly in scenarios involving complex motions and multi-view perspectives.

Technology Category

Application Category

๐Ÿ“ Abstract
Building on the success of diffusion models in image generation and editing, video editing has recently gained substantial attention. However, maintaining temporal consistency and motion alignment still remains challenging. To address these issues, this paper proposes DINO-guided Video Editing (DIVE), a framework designed to facilitate subject-driven editing in source videos conditioned on either target text prompts or reference images with specific identities. The core of DIVE lies in leveraging the powerful semantic features extracted from a pretrained DINOv2 model as implicit correspondences to guide the editing process. Specifically, to ensure temporal motion consistency, DIVE employs DINO features to align with the motion trajectory of the source video. Extensive experiments on diverse real-world videos demonstrate that our framework can achieve high-quality editing results with robust motion consistency, highlighting the potential of DINO to contribute to video editing. For precise subject editing, DIVE incorporates the DINO features of reference images into a pretrained text-to-image model to learn Low-Rank Adaptations (LoRAs), effectively registering the target subject's identity. Project page: https://dino-video-editing.github.io
Problem

Research questions and friction points this paper is trying to address.

Ensuring temporal consistency in video editing
Aligning motion trajectories in source videos
Precise subject-driven editing with reference images
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages DINOv2 features for semantic guidance
Aligns DINO features with source motion trajectory
Uses LoRAs for precise subject identity registration
๐Ÿ”Ž Similar Papers
No similar papers found.
Y
Yi Huang
Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences
W
Wei Xiong
Adobe Research
H
He Zhang
Adobe Research
Chaoqi Chen
Chaoqi Chen
Shenzhen University
Machine LearningComputer VisionTrustworthy AIData-centric AI
Jianzhuang Liu
Jianzhuang Liu
Shenzhen Institutes of Advanced Technology, University of Chinese Academy of Sciences
Computer VisionImage ProcessingAIGCMachine Learning
Mingfu Yan
Mingfu Yan
Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences
AIGC
S
Shifeng Chen
Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences