Omni-Video 2: Scaling MLLM-Conditioned Diffusion for Unified Video Generation and Editing

📅 2026-02-09

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

Existing video generation and editing methods still struggle to faithfully follow fine-grained, compositional user instructions. This work proposes a novel framework that, for the first time, deeply integrates the semantic understanding and reasoning capabilities of a pretrained multimodal large language model (MLLM) into a video diffusion model. By employing lightweight adapters, the approach efficiently injects multimodal conditional information in a parameter-efficient manner, enabling high-quality video generation and fine-grained editing. The method unifies support for high-resolution, multi-task scenarios and significantly outperforms existing approaches on the FiVE and VBench benchmarks, achieving state-of-the-art performance particularly in complex instruction following, generation quality, and editing flexibility.

Technology Category

Application Category

📝 Abstract

We present Omni-Video 2, a scalable and computationally efficient model that connects pretrained multimodal large-language models (MLLMs) with video diffusion models for unified video generation and editing. Our key idea is to exploit the understanding and reasoning capabilities of MLLMs to produce explicit target captions to interpret user instructions. In this way, the rich contextual representations from the understanding model are directly used to guide the generative process, thereby improving performance on complex and compositional editing. Moreover, a lightweight adapter is developed to inject multimodal conditional tokens into pretrained text-to-video diffusion models, allowing maximum reuse of their powerful generative priors in a parameter-efficient manner. Benefiting from these designs, we scale up Omni-Video 2 to a 14B video diffusion model on meticulously curated training data with quality, supporting high quality text-to-video generation and various video editing tasks such as object removal, addition, background change, complex motion editing, \emph{etc.} We evaluate the performance of Omni-Video 2 on the FiVE benchmark for fine-grained video editing and the VBench benchmark for text-to-video generation. The results demonstrate its superior ability to follow complex compositional instructions in video editing, while also achieving competitive or superior quality in video generation tasks.

Problem

Research questions and friction points this paper is trying to address.

video generation

video editing

multimodal large language models

diffusion models

compositional instructions

Innovation

Methods, ideas, or system contributions that make the work stand out.

MLLM-conditioned diffusion

unified video generation and editing

lightweight adapter