Motif-Video 2B: Technical Report

📅 2026-04-14

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

This work addresses the significant challenge of achieving high-quality text-to-video generation under stringent constraints of limited data (<10M video clips) and computational budget (<100,000 H200 GPU hours). To overcome this, the authors propose a role-disentangled architecture that decouples the generation process into three specialized subtasks: prompt alignment, temporal consistency, and detail restoration. This is realized through a three-stage backbone network, shared cross-attention mechanisms, dynamic token routing, and an early-feature alignment strategy leveraging a frozen pretrained video encoder. By isolating these objectives, the approach mitigates multi-task interference and enhances the efficiency of smaller models. Evaluated on VBench, the method achieves a score of 83.76%, substantially outperforming the Wan2.1 14B model—despite having seven times fewer parameters—thereby demonstrating the efficacy of architectural specialization and efficient training strategies.

Technology Category

Application Category

📝 Abstract

Training strong video generation models usually requires massive datasets, large parameter counts, and substantial compute. In this work, we ask whether strong text-to-video quality is possible at a much smaller budget: fewer than 10M clips and less than 100,000 H200 GPU hours. Our core claim is that part of the answer lies in how model capacity is organized, not only in how much of it is used. In video generation, prompt alignment, temporal consistency, and fine-detail recovery can interfere with one another when they are handled through the same pathway. Motif-Video 2B addresses this by separating these roles architecturally, rather than relying on scale alone. The model combines two key ideas. First, Shared Cross-Attention strengthens text control when video token sequences become long. Second, a three-part backbone separates early fusion, joint representation learning, and detail refinement. To make this design effective under a limited compute budget, we pair it with an efficient training recipe based on dynamic token routing and early-phase feature alignment to a frozen pretrained video encoder. Our analysis shows that later blocks develop clearer cross-frame attention structure than standard single-stream baselines. On VBench, Motif-Video~2B reaches 83.76\%, surpassing Wan2.1 14B while using 7$\times$ fewer parameters and substantially less training data. These results suggest that careful architectural specialization, combined with an efficiency-oriented training recipe, can narrow or exceed the quality gap typically associated with much larger video models.

Problem

Research questions and friction points this paper is trying to address.

text-to-video generation

temporal consistency

prompt alignment

model architecture

compute efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

architectural specialization

shared cross-attention

three-part backbone