UniVideo: Unified Understanding, Generation, and Editing for Videos

📅 2025-10-09

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

Existing unified multimodal models predominantly focus on image understanding, lacking integrated capabilities for video understanding, generation, and editing. Method: We propose UniVideo—the first end-to-end framework unifying diverse video tasks—built upon a dual-stream architecture combining a Multimodal Large Language Model (MLLM) and a Multimodal Diffusion Transformer (MMDiT). It jointly models instruction parsing, video synthesis, and visual consistency preservation within a single instruction-following paradigm, enabled by vision-guided prompting and joint training. Contribution/Results: UniVideo achieves task-composition generalization and zero-shot transfer without task-specific fine-tuning, supporting flexible video editing—including style transfer + editing co-application, zero-shot chroma keying, and material replacement. On benchmarks spanning text/image-to-video generation, in-context generation, and video editing, it matches or surpasses dedicated models, significantly advancing cross-modal synergy and generalization across video modalities.

Technology Category

Application Category

📝 Abstract

Unified multimodal models have shown promising results in multimodal content generation and editing but remain largely limited to the image domain. In this work, we present UniVideo, a versatile framework that extends unified modeling to the video domain. UniVideo adopts a dual-stream design, combining a Multimodal Large Language Model (MLLM) for instruction understanding with a Multimodal DiT (MMDiT) for video generation. This design enables accurate interpretation of complex multimodal instructions while preserving visual consistency. Built on this architecture, UniVideo unifies diverse video generation and editing tasks under a single multimodal instruction paradigm and is jointly trained across them. Extensive experiments demonstrate that UniVideo matches or surpasses state-of-the-art task-specific baselines in text/image-to-video generation, in-context video generation and in-context video editing. Notably, the unified design of UniVideo enables two forms of generalization. First, UniVideo supports task composition, such as combining editing with style transfer, by integrating multiple capabilities within a single instruction. Second, even without explicit training on free-form video editing, UniVideo transfers its editing capability from large-scale image editing data to this setting, handling unseen instructions such as green-screening characters or changing materials within a video. Beyond these core capabilities, UniVideo also supports visual-prompt-based video generation, where the MLLM interprets visual prompts and guides the MMDiT during synthesis. To foster future research, we will release our model and code.

Problem

Research questions and friction points this paper is trying to address.

Extends unified multimodal modeling to video generation and editing

Enables accurate interpretation of complex multimodal video instructions

Unifies diverse video tasks under single multimodal instruction paradigm

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-stream design combining MLLM and MMDiT

Unified multimodal instruction paradigm for video tasks

Transfers image editing capabilities to video domain

🔎 Similar Papers

VideoPrism: A Foundational Visual Encoder for Video Understanding