OmniWeaving: Towards Unified Video Generation with Free-form Composition and Reasoning

📅 2026-03-25

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

This work addresses the challenge that existing open-source video generation models struggle to effectively support flexible combinations and reasoning across diverse tasks within a unified framework. To this end, we propose the first unified video generation architecture capable of handling free-form multimodal inputs and reasoning-driven generation. Through large-scale multimodal pretraining, our model integrates text, multiple images, and video inputs, augmented with temporal alignment and an agent-like reasoning mechanism to interpret complex user intents and generate high-quality videos. We also introduce IntelligentVBench, the first benchmark tailored for evaluating intelligent video generation. Our method achieves state-of-the-art performance among open-source unified video generation models, and both the code and models are publicly released.

Technology Category

Application Category

📝 Abstract

While proprietary systems such as Seedance-2.0 have achieved remarkable success in omni-capable video generation, open-source alternatives significantly lag behind. Most academic models remain heavily fragmented, and the few existing efforts toward unified video generation still struggle to seamlessly integrate diverse tasks within a single framework. To bridge this gap, we propose OmniWeaving, an omni-level video generation model featuring powerful multimodal composition and reasoning-informed capabilities. By leveraging a massive-scale pretraining dataset that encompasses diverse compositional and reasoning-augmented scenarios, OmniWeaving learns to temporally bind interleaved text, multi-image, and video inputs while acting as an intelligent agent to infer complex user intentions for sophisticated video creation. Furthermore, we introduce IntelligentVBench, the first comprehensive benchmark designed to rigorously assess next-level intelligent unified video generation. Extensive experiments demonstrate that OmniWeaving achieves SoTA performance among open-source unified models. The codes and model will be made publicly available soon. Project Page: https://omniweaving.github.io.

Problem

Research questions and friction points this paper is trying to address.

unified video generation

open-source models

multimodal composition

reasoning-informed video creation

fragmented academic models

Innovation

Methods, ideas, or system contributions that make the work stand out.

unified video generation

multimodal composition

reasoning-informed generation