Omni-Video: Democratizing Unified Video Understanding and Generation

📅 2025-07-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing foundational models predominantly focus on static images, lacking a unified, efficient framework for joint video understanding, generation, and instruction-driven editing. To address this gap, we propose UniVid: a lightweight multimodal large language model (MLLM) synergized with a diffusion decoder. UniVid employs cross-modal visual token propagation and conditional spatial alignment to enable the MLLM to accurately model continuous visual dynamics and effectively steer the diffusion model for high-fidelity video generation. A staged training strategy facilitates joint optimization of all three tasks under constrained computational resources. Extensive experiments demonstrate that UniVid significantly improves generalization and generation quality across diverse benchmarks—including video question answering, text-to-video synthesis, and instruction-guided video editing—achieving, for the first time, an efficient, unified architecture that seamlessly integrates understanding, generation, and editing capabilities.

Technology Category

Application Category

📝 Abstract
Notable breakthroughs in unified understanding and generation modeling have led to remarkable advancements in image understanding, reasoning, production and editing, yet current foundational models predominantly focus on processing images, creating a gap in the development of unified models for video understanding and generation. This report presents Omni-Video, an efficient and effective unified framework for video understanding, generation, as well as instruction-based editing. Our key insight is to teach existing multimodal large language models (MLLMs) to produce continuous visual clues that are used as the input of diffusion decoders, which produce high-quality videos conditioned on these visual clues. To fully unlock the potential of our system for unified video modeling, we integrate several technical improvements: 1) a lightweight architectural design that respectively attaches a vision head on the top of MLLMs and a adapter before the input of diffusion decoders, the former produce visual tokens for the latter, which adapts these visual tokens to the conditional space of diffusion decoders; and 2) an efficient multi-stage training scheme that facilitates a fast connection between MLLMs and diffusion decoders with limited data and computational resources. We empirically demonstrate that our model exhibits satisfactory generalization abilities across video generation, editing and understanding tasks.
Problem

Research questions and friction points this paper is trying to address.

Bridging the gap in unified video understanding and generation models
Enabling multimodal LLMs to generate videos via visual clues
Developing efficient training for video tasks with limited resources
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal LLMs generate continuous visual clues
Lightweight architecture connects MLLMs and diffusion decoders
Efficient multi-stage training with limited resources
🔎 Similar Papers
No similar papers found.
Z
Zhiyu Tan
Fudan University
H
Hao Yang
Shanghai Academy of Artificial Intelligence for Science
Luozheng Qin
Luozheng Qin
Shanghai Academy of AI for Science
generative modeltext-to-image generationneck-choking technology
J
Jia Gong
Shanghai Academy of Artificial Intelligence for Science
Mengping Yang
Mengping Yang
East China University of Science and Technology
Few-shot LearningGenerative Models
H
Hao Li
Fudan University