VACE: All-in-One Video Creation and Editing

📅 2025-03-10

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

This work addresses the fragmentation between video generation and editing tasks, as well as the lack of unified conditional modeling. We propose the first diffusion Transformer-based unified video synthesis framework. Methodologically, we design a video-level unified conditional modeling interface—Video Conditioning Unit (VCU)—to encode heterogeneous inputs (e.g., reference videos, edit instructions, and masks) into consistent latent representations; additionally, we introduce a spatiotemporal-explicit Context Adapter that dynamically injects task-aware spatiotemporal priors. The framework seamlessly supports three distinct tasks—reference-guided generation, video-to-video editing, and mask-guided editing—within a single model. Experiments demonstrate competitive performance against task-specific models across multiple subtasks and enable cross-task compositional applications, significantly enhancing model generality and practical utility.

Technology Category

Application Category

📝 Abstract

Diffusion Transformer has demonstrated powerful capability and scalability in generating high-quality images and videos. Further pursuing the unification of generation and editing tasks has yielded significant progress in the domain of image content creation. However, due to the intrinsic demands for consistency across both temporal and spatial dynamics, achieving a unified approach for video synthesis remains challenging. We introduce VACE, which enables users to perform Video tasks within an All-in-one framework for Creation and Editing. These tasks include reference-to-video generation, video-to-video editing, and masked video-to-video editing. Specifically, we effectively integrate the requirements of various tasks by organizing video task inputs, such as editing, reference, and masking, into a unified interface referred to as the Video Condition Unit (VCU). Furthermore, by utilizing a Context Adapter structure, we inject different task concepts into the model using formalized representations of temporal and spatial dimensions, allowing it to handle arbitrary video synthesis tasks flexibly. Extensive experiments demonstrate that the unified model of VACE achieves performance on par with task-specific models across various subtasks. Simultaneously, it enables diverse applications through versatile task combinations. Project page: https://ali-vilab.github.io/VACE-Page/.

Problem

Research questions and friction points this paper is trying to address.

Unified approach for video synthesis and editing tasks.

Integration of temporal and spatial dynamics in video tasks.

Flexible handling of diverse video synthesis applications.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified framework for video creation and editing

Video Condition Unit integrates task inputs

Context Adapter injects task concepts flexibly

🔎 Similar Papers

EditBoard: Towards A Comprehensive Evaluation Benchmark for Text-based Video Editing Models