π€ AI Summary
To address inefficiencies in data processing, scalability limitations in training, and challenges in multimodal coordination for high-fidelity long-video generation, this paper introduces NeMo-VFMβthe first end-to-end acceleration framework tailored for Video Foundation Models (VFMs). Methodologically, it integrates intelligent video filtering, asynchronous multimodal data loading, dynamic-resolution video ingestion, and cross-modal alignment preprocessing, while unifying distributed diffusion model training and inference. Built upon NVIDIA NeMo and multi-node GPU parallelism, the framework significantly improves training throughput and GPU memory efficiency. Empirically, it achieves state-of-the-art video generation quality across multiple benchmarks, enabling kilo-frame high-fidelity modeling and real-time inference. This work delivers a scalable, high-performance, system-level solution for open-source VFM training.
π Abstract
Video Foundation Models (VFMs) have recently been used to simulate the real world to train physical AI systems and develop creative visual experiences. However, there are significant challenges in training large-scale, high quality VFMs that can generate high-quality videos. We present a scalable, open-source VFM training pipeline with NVIDIA NeMo, providing accelerated video dataset curation, multimodal data loading, and parallelized video diffusion model training and inference. We also provide a comprehensive performance analysis highlighting best practices for efficient VFM training and inference.