🤖 AI Summary
Existing video generation and editing methods still struggle to faithfully follow fine-grained, compositional user instructions. This work proposes a novel framework that, for the first time, deeply integrates the semantic understanding and reasoning capabilities of a pretrained multimodal large language model (MLLM) into a video diffusion model. By employing lightweight adapters, the approach efficiently injects multimodal conditional information in a parameter-efficient manner, enabling high-quality video generation and fine-grained editing. The method unifies support for high-resolution, multi-task scenarios and significantly outperforms existing approaches on the FiVE and VBench benchmarks, achieving state-of-the-art performance particularly in complex instruction following, generation quality, and editing flexibility.
📝 Abstract
We present Omni-Video 2, a scalable and computationally efficient model that connects pretrained multimodal large-language models (MLLMs) with video diffusion models for unified video generation and editing. Our key idea is to exploit the understanding and reasoning capabilities of MLLMs to produce explicit target captions to interpret user instructions. In this way, the rich contextual representations from the understanding model are directly used to guide the generative process, thereby improving performance on complex and compositional editing. Moreover, a lightweight adapter is developed to inject multimodal conditional tokens into pretrained text-to-video diffusion models, allowing maximum reuse of their powerful generative priors in a parameter-efficient manner. Benefiting from these designs, we scale up Omni-Video 2 to a 14B video diffusion model on meticulously curated training data with quality, supporting high quality text-to-video generation and various video editing tasks such as object removal, addition, background change, complex motion editing, \emph{etc.} We evaluate the performance of Omni-Video 2 on the FiVE benchmark for fine-grained video editing and the VBench benchmark for text-to-video generation. The results demonstrate its superior ability to follow complex compositional instructions in video editing, while also achieving competitive or superior quality in video generation tasks.