Movie Gen: A Cast of Media Foundation Models

📅 2024-10-17
🏛️ arXiv.org
📈 Citations: 65
Influential: 9
📄 PDF
🤖 AI Summary
This work addresses the challenge of high-quality multimodal video generation and editing by proposing a unified multimodal foundation model architecture. Methodologically, it introduces variable-aspect-ratio 1080p video latent-space modeling, cross-modal alignment training across text, image, video, and audio modalities, efficient tokenization, large-scale parallel training and inference optimization, and a rigorously quality-controlled data curation strategy coupled with a novel evaluation protocol. Key contributions include the first 30-billion-parameter video generation model supporting long-horizon generation (73K tokens, i.e., 16 seconds at 16 fps), instruction-driven precise editing, user-provided image personalization, and synchronized audio-video synthesis. The model achieves state-of-the-art performance across five benchmarks: text-to-video, video personalization, video editing, video-to-audio, and text-to-audio—demonstrating substantial improvements in temporal coherence and semantic controllability.

Technology Category

Application Category

📝 Abstract
We present Movie Gen, a cast of foundation models that generates high-quality, 1080p HD videos with different aspect ratios and synchronized audio. We also show additional capabilities such as precise instruction-based video editing and generation of personalized videos based on a user's image. Our models set a new state-of-the-art on multiple tasks: text-to-video synthesis, video personalization, video editing, video-to-audio generation, and text-to-audio generation. Our largest video generation model is a 30B parameter transformer trained with a maximum context length of 73K video tokens, corresponding to a generated video of 16 seconds at 16 frames-per-second. We show multiple technical innovations and simplifications on the architecture, latent spaces, training objectives and recipes, data curation, evaluation protocols, parallelization techniques, and inference optimizations that allow us to reap the benefits of scaling pre-training data, model size, and training compute for training large scale media generation models. We hope this paper helps the research community to accelerate progress and innovation in media generation models. All videos from this paper are available at https://go.fb.me/MovieGenResearchVideos.
Problem

Research questions and friction points this paper is trying to address.

Develop high-quality 1080p HD video generation.
Enable precise instruction-based video editing.
Generate personalized videos using user images.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates 1080p HD videos
30B parameter transformer model
Precise instruction-based video editing
🔎 Similar Papers
No similar papers found.
Adam Polyak
Adam Polyak
Facebook AI Research
MLDLSpeech Processing
Amit Zohar
Amit Zohar
Facebook AI Research
A
Andrew Brown
Andros Tjandra
Andros Tjandra
FAIR (Meta AI)
speech recognitionspeech processingmachine learningdeep learningnatural language processing
A
Animesh Sinha
Ann Lee
Ann Lee
Meta AI
Apoorv Vyas
Apoorv Vyas
FAIR Labs Meta
Deep LearningSpeech RecognitionComputer Vision
B
Bowen Shi
Chih-Yao Ma
Chih-Yao Ma
Member of Technical Staff @ Microsoft AI
Generative ModelComputer VisionNatural Language ProcessingMachine LearningDeep Learning
Ching-Yao Chuang
Ching-Yao Chuang
xAI
Generative AIMachine Learning
D
David Yan
D
Dhruv Choudhary
D
Dingkang Wang
G
Geet Sethi
Guan Pang
Guan Pang
Meta GenAI
Generative AIOCR
H
Haoyu Ma
Ishan Misra
Ishan Misra
GenAI, Meta
Computer VisionMachine Learning
Ji Hou
Ji Hou
Research Scientist, Meta Superintelligence Labs
Generative AI3D Computer Vision
Jialiang Wang
Jialiang Wang
Research Scientist, Meta AI
Computer VisionGenerative AI
K
Kiran Jagadeesh
Kunpeng Li
Kunpeng Li
Research Scientist, Meta Superintelligence Labs
Computer VisionDeep Learning
L
Luxin Zhang
Mannat Singh
Mannat Singh
M
Mary Williamson
M
Matt Le
M
Matthew Yu
M
Mitesh Kumar Singh
Peizhao Zhang
Peizhao Zhang
Research Scientist, Meta MSL
Computer VisionComputer Graphics
P
Peter Vajda
Quentin Duval
Quentin Duval
Rohit Girdhar
Rohit Girdhar
Research Scientist, GenAI, Meta
Computer VisionMachine Learning
R
Roshan Sumbaly
Sai Saketh Rambhatla
Sai Saketh Rambhatla
Research Scientist
Sam S. Tsai
Sam S. Tsai
Stealth Startup, ex-Meta, ex-Amazon, Stanford
Generative AIMLLMVisual SearchComputer VisionMultimedia
S
S. Azadi
Samyak Datta
Samyak Datta
S
Sanyuan Chen
Sean Bell
Sean Bell
S
Sharadh Ramaswamy
Shelly Sheynin
Shelly Sheynin
Meta AI research
Generative modelsComputer VisionDeep Learning
S
Siddharth Bhattacharya
S
Simran Motwani
T
Tao Xu
Tianhe Li
Tianhe Li
Tingbo Hou
Tingbo Hou
Google DeepMind
Computer VisionGenerative AI
Wei-Ning Hsu
Wei-Ning Hsu
Facebook AI Research (FAIR)
Speech ProcessingSpeech SynthesisAudio GenerationMachine Learning
Xi Yin
Xi Yin
Research Scientist, Facebook
Computer VisionMachine LearningDeep Learning
Xiaoliang Dai
Xiaoliang Dai
Research Scientist, Meta GenAI
Generative AIComputer vision
Yaniv Taigman
Yaniv Taigman
Meta
machine learningcomputer vision
Y
Yaqiao Luo
Yen-Cheng Liu
Yen-Cheng Liu
Research Scientist, Meta
Computer VisionMachine LearningArtificial Intelligence
Y
Yi-Chiao Wu
Y
Yue Zhao
Yuval Kirstain
Yuval Kirstain
GenAI, Meta
Natural Language ProcessingDeep Learning
Zecheng He
Zecheng He
Meta GenAI
Generative AIEfficient ModelAI Security and Privacy
Z
Zijian He
Albert Pumarola
Albert Pumarola
Meta - Superintelligence Labs
Generative models
A
Ali K. Thabet
A
A. Sanakoyeu
Arun Mallya
Arun Mallya
Baishan Guo
Baishan Guo
Meta AI
B
Boris Araya
B
Breena Kerr
C
Carleigh Wood
Ce Liu
Ce Liu
AI Research Scientist Director, Meta GenAI; IEEE Fellow
GenAIcomputer visioncomputer graphicsmachine learning
C
Cen Peng
D
Dimitry Vengertsev
E
Edgar Schonfeld
E
Elliot Blanchard
Felix Juefei-Xu
Felix Juefei-Xu
Research Scientist, Meta Superintelligence Labs
Generative ModelsDeep LearningComputer VisionAI SafetyAdversarial Robustness
F
Fraylie Nord
J
Jeff Liang
John Hoffman
John Hoffman
Meta AI
Artificial IntelligenceBayesian StatisticsTime-series Analysis
Jonas Kohler
Jonas Kohler
ETH Zürich
GenAINon-convex OptimizationMachine Learning
K
Kaolin Fire
K
Karthik Sivakumar
Lawrence Chen
Lawrence Chen
L
Licheng Yu
L
Luya Gao
Markos Georgopoulos
Markos Georgopoulos
Research Scientist, Meta AI
Generative ModelsMachine LearningComputer Vision
R
Rashel Moritz
S
Sara K. Sampson
S
Shikai Li
S
Simone Parmeggiani
S
Steve Fine
T
Tara Fowler
V
Vladan Petrovic
Yuming Du
Yuming Du
Meta - Superintelligence Labs
computer vision