Training Video Foundation Models with NVIDIA NeMo

πŸ“… 2025-03-17
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address inefficiencies in data processing, scalability limitations in training, and challenges in multimodal coordination for high-fidelity long-video generation, this paper introduces NeMo-VFMβ€”the first end-to-end acceleration framework tailored for Video Foundation Models (VFMs). Methodologically, it integrates intelligent video filtering, asynchronous multimodal data loading, dynamic-resolution video ingestion, and cross-modal alignment preprocessing, while unifying distributed diffusion model training and inference. Built upon NVIDIA NeMo and multi-node GPU parallelism, the framework significantly improves training throughput and GPU memory efficiency. Empirically, it achieves state-of-the-art video generation quality across multiple benchmarks, enabling kilo-frame high-fidelity modeling and real-time inference. This work delivers a scalable, high-performance, system-level solution for open-source VFM training.

Technology Category

Application Category

πŸ“ Abstract
Video Foundation Models (VFMs) have recently been used to simulate the real world to train physical AI systems and develop creative visual experiences. However, there are significant challenges in training large-scale, high quality VFMs that can generate high-quality videos. We present a scalable, open-source VFM training pipeline with NVIDIA NeMo, providing accelerated video dataset curation, multimodal data loading, and parallelized video diffusion model training and inference. We also provide a comprehensive performance analysis highlighting best practices for efficient VFM training and inference.
Problem

Research questions and friction points this paper is trying to address.

Challenges in training large-scale, high-quality Video Foundation Models.
Need for scalable and efficient video dataset curation and model training.
Optimizing performance for video diffusion model training and inference.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Scalable open-source VFM training pipeline
Accelerated video dataset curation
Parallelized video diffusion model training
πŸ”Ž Similar Papers
No similar papers found.
Zeeshan Patel
Zeeshan Patel
xAI
Deep LearningGenerative AIComputer Vision
Ethan He
Ethan He
xAI
LLMdeep learningmultimodalcomputer vision
P
Parth Mannan
NVIDIA
Xiaowei Ren
Xiaowei Ren
Senior Deep Learning Architect, NVIDIA
Computer Architecture
R
Ryan Wolf
NVIDIA
N
Niket Agarwal
NVIDIA
J
Jacob Huffman
NVIDIA
Zhuoyao Wang
Zhuoyao Wang
NVIDIA
C
Carl Wang
NVIDIA
J
Jack Chang
NVIDIA
Yan Bai
Yan Bai
University of Rochester
macroeconomicsinternational macroeconomics
T
Tommy Huang
NVIDIA
Linnan Wang
Linnan Wang
Brown University
Artificial IntelligenceHigh Performance ComputingDistributed Systems
S
Sahil Jain
NVIDIA
S
Shanmugam Ramasamy
NVIDIA
J
Joseph Jennings
NVIDIA
E
Ekaterina Sirazitdinova
NVIDIA
O
Oleg Sudakov
NVIDIA
Mingyuan Ma
Mingyuan Ma
UC Berkeley / Harvard University / NVIDIA
B
Bobby Chen
NVIDIA
F
Forrest Lin
NVIDIA
H
Hao Wang
NVIDIA
V
Vasanth Rao Naik Sabavat
NVIDIA
S
Sriharsha Niverty
NVIDIA
R
Rong Ou
NVIDIA
P
Pallab Bhattacharya
NVIDIA
D
David Page
NVIDIA
Nima Tajbakhsh
Nima Tajbakhsh
Nvidia Inc.
Computer vision and Artificial Intelligence
A
Ashwath Aithal
NVIDIA