Uni-ViGU: Towards Unified Video Generation and Understanding via A Diffusion-Based Video Generator

📅 2026-04-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the severe computational imbalance between visual understanding and generation—particularly in the video domain, where generation is significantly more expensive than comprehension—by proposing a unified architecture centered on a diffusion-based video generator. The framework enables knowledge transfer from generation to understanding through joint alignment of continuous video streams and discrete text streams, a modality-driven Mixture-of-Experts (MoE)-enhanced Transformer, and a bidirectional training mechanism comprising knowledge back-propagation and capability refinement. Evaluated across both video generation and understanding tasks, the model achieves competitive performance, demonstrating the feasibility of a generation-centric paradigm as a pathway toward unified multimodal intelligence.
📝 Abstract
Unified multimodal models integrating visual understanding and generation face a fundamental challenge: visual generation incurs substantially higher computational costs than understanding, particularly for video. This imbalance motivates us to invert the conventional paradigm: rather than extending understanding-centric MLLMs to support generation, we propose Uni-ViGU, a framework that unifies video generation and understanding by extending a video generator as the foundation. We introduce a unified flow method that performs continuous flow matching for video and discrete flow matching for text within a single process, enabling coherent multimodal generation. We further propose a modality-driven MoE-based framework that augments Transformer blocks with lightweight layers for text generation while preserving generative priors. To repurpose generation knowledge for understanding, we design a bidirectional training mechanism with two stages: Knowledge Recall reconstructs input prompts to leverage learned text-video correspondences, while Capability Refinement fine-tunes on detailed captions to establish discriminative shared representations. Experiments demonstrate that Uni-ViGU achieves competitive performance on both video generation and understanding, validating generation-centric architectures as a scalable path toward unified multimodal intelligence. Project Page and Code: https://fr0zencrane.github.io/uni-vigu-page/.
Problem

Research questions and friction points this paper is trying to address.

unified multimodal models
video generation
visual understanding
computational cost imbalance
Innovation

Methods, ideas, or system contributions that make the work stand out.

unified multimodal learning
diffusion-based video generation
flow matching
modality-driven MoE
bidirectional training
🔎 Similar Papers
No similar papers found.
Luozheng Qin
Luozheng Qin
Shanghai Academy of AI for Science
generative modeltext-to-image generationneck-choking technology
J
Jia Gong
Shanghai Academy of AI for Science
Q
Qian Qiao
Independent Researcher
T
Tianjiao Li
Singapore University of Technology and Design
L
Li Xu
Singapore University of Technology and Design
H
Haoyu Pan
Shanghai Academy of AI for Science
C
Chao Qu
Fudan University
Z
Zhiyu Tan
Fudan University
Hao Li
Hao Li
FUDAN UNIVERSITY,DAMO@ALIBABA
Computer VisionDeep LearningAI4S