Uni-ViGU: Towards Unified Video Generation and Understanding via A Diffusion-Based Video Generator

📅 2026-04-09

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

This work addresses the severe computational imbalance between visual understanding and generation—particularly in the video domain, where generation is significantly more expensive than comprehension—by proposing a unified architecture centered on a diffusion-based video generator. The framework enables knowledge transfer from generation to understanding through joint alignment of continuous video streams and discrete text streams, a modality-driven Mixture-of-Experts (MoE)-enhanced Transformer, and a bidirectional training mechanism comprising knowledge back-propagation and capability refinement. Evaluated across both video generation and understanding tasks, the model achieves competitive performance, demonstrating the feasibility of a generation-centric paradigm as a pathway toward unified multimodal intelligence.

Technology Category

Application Category

📝 Abstract

Unified multimodal models integrating visual understanding and generation face a fundamental challenge: visual generation incurs substantially higher computational costs than understanding, particularly for video. This imbalance motivates us to invert the conventional paradigm: rather than extending understanding-centric MLLMs to support generation, we propose Uni-ViGU, a framework that unifies video generation and understanding by extending a video generator as the foundation. We introduce a unified flow method that performs continuous flow matching for video and discrete flow matching for text within a single process, enabling coherent multimodal generation. We further propose a modality-driven MoE-based framework that augments Transformer blocks with lightweight layers for text generation while preserving generative priors. To repurpose generation knowledge for understanding, we design a bidirectional training mechanism with two stages: Knowledge Recall reconstructs input prompts to leverage learned text-video correspondences, while Capability Refinement fine-tunes on detailed captions to establish discriminative shared representations. Experiments demonstrate that Uni-ViGU achieves competitive performance on both video generation and understanding, validating generation-centric architectures as a scalable path toward unified multimodal intelligence. Project Page and Code: https://fr0zencrane.github.io/uni-vigu-page/.

Problem

Research questions and friction points this paper is trying to address.

unified multimodal models

video generation

visual understanding

computational cost imbalance

Innovation

Methods, ideas, or system contributions that make the work stand out.

unified multimodal learning

diffusion-based video generation

flow matching

modality-driven MoE

bidirectional training

🔎 Similar Papers

No similar papers found.

TikTok

San Jose, California

AI Research Scientist, Video Generation and Post Training, FAIR