MUG-V 10B: High-efficiency Training Pipeline for Large Video Generation Models

📅 2025-10-20

📈 Citations: 0

✨ Influential: 0

career value

231K/year

🤖 AI Summary

Training large-scale video generation models faces challenges including difficult cross-modal alignment, complex long-horizon temporal modeling, and prohibitive computational costs. To address these, this paper introduces MUG-V—the first efficient training framework for video generation built upon Megatron-Core. Methodologically: (1) it proposes a video compression and spatiotemporal disentanglement architecture to reduce sequence length and computational overhead; (2) it incorporates cross-modal alignment optimization and curriculum-style pretraining to enhance text–video semantic consistency; and (3) it delivers an end-to-end open-source training stack supporting near-linear multi-node scalability. Experiments demonstrate that MUG-V 10B achieves state-of-the-art performance across multiple e-commerce video generation benchmarks, significantly outperforming leading open-source models. Crucially, the framework releases model weights, training scripts, and inference code—ensuring full reproducibility and facilitating industrial deployment of large video foundation models.

Technology Category

Application Category

📝 Abstract

In recent years, large-scale generative models for visual content ( extit{e.g.,} images, videos, and 3D objects/scenes) have made remarkable progress. However, training large-scale video generation models remains particularly challenging and resource-intensive due to cross-modal text-video alignment, the long sequences involved, and the complex spatiotemporal dependencies. To address these challenges, we present a training framework that optimizes four pillars: (i) data processing, (ii) model architecture, (iii) training strategy, and (iv) infrastructure for large-scale video generation models. These optimizations delivered significant efficiency gains and performance improvements across all stages of data preprocessing, video compression, parameter scaling, curriculum-based pretraining, and alignment-focused post-training. Our resulting model, MUG-V 10B, matches recent state-of-the-art video generators overall and, on e-commerce-oriented video generation tasks, surpasses leading open-source baselines in human evaluations. More importantly, we open-source the complete stack, including model weights, Megatron-Core-based large-scale training code, and inference pipelines for video generation and enhancement. To our knowledge, this is the first public release of large-scale video generation training code that exploits Megatron-Core to achieve high training efficiency and near-linear multi-node scaling, details are available in href{https://github.com/Shopee-MUG/MUG-V}{our webpage}.

Problem

Research questions and friction points this paper is trying to address.

Optimizing training efficiency for large video generation models

Addressing challenges in cross-modal text-video alignment

Solving resource-intensive spatiotemporal dependency modeling

Innovation

Methods, ideas, or system contributions that make the work stand out.

Optimizes data processing, model architecture, training strategy, infrastructure

Uses curriculum-based pretraining and alignment-focused post-training

Exploits Megatron-Core for high efficiency and multi-node scaling

🔎 Similar Papers

No similar papers found.