Beyond Language Modeling: An Exploration of Multimodal Pretraining

📅 2026-03-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work explores the design space of natively multimodal foundation models, addressing how to effectively integrate vision and language beyond conventional language modeling. Building upon the Transfusion framework, the authors propose a unified pretraining approach from scratch that jointly leverages next-token prediction and diffusion-based generation, augmented with a Representation Autoencoder (RAE) to unify visual representations for both understanding and generation. The study reveals the complementary nature and asymmetric scaling behavior of vision and language data—where vision benefits more substantially from increased data volume—and employs a Mixture-of-Experts (MoE) architecture to enable efficient modality specialization and model expansion. Experiments demonstrate that unified pretraining naturally induces world modeling capabilities and significantly enhances performance on downstream tasks, laying a foundation for truly integrated multimodal foundation models.

Technology Category

Application Category

📝 Abstract
The visual world offers a critical axis for advancing foundation models beyond language. Despite growing interest in this direction, the design space for native multimodal models remains opaque. We provide empirical clarity through controlled, from-scratch pretraining experiments, isolating the factors that govern multimodal pretraining without interference from language pretraining. We adopt the Transfusion framework, using next-token prediction for language and diffusion for vision, to train on diverse data including text, video, image-text pairs, and even action-conditioned video. Our experiments yield four key insights: (i) Representation Autoencoder (RAE) provides an optimal unified visual representation by excelling at both visual understanding and generation; (ii) visual and language data are complementary and yield synergy for downstream capabilities; (iii) unified multimodal pretraining leads naturally to world modeling, with capabilities emerging from general training; and (iv) Mixture-of-Experts (MoE) enables efficient and effective multimodal scaling while naturally inducing modality specialization. Through IsoFLOP analysis, we compute scaling laws for both modalities and uncover a scaling asymmetry: vision is significantly more data-hungry than language. We demonstrate that the MoE architecture harmonizes this scaling asymmetry by providing the high model capacity required by language while accommodating the data-intensive nature of vision, paving the way for truly unified multimodal models.
Problem

Research questions and friction points this paper is trying to address.

multimodal pretraining
foundation models
scaling laws
modality asymmetry
unified representation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal Pretraining
Representation Autoencoder
Mixture-of-Experts
Scaling Laws
World Modeling
🔎 Similar Papers
Shengbang Tong
Shengbang Tong
NYU Courant
AIComputer VisionDeep LearningRepresentation Learning
David Fan
David Fan
Meta FAIR Labs
AIComputer VisionDeep LearningRepresentation Learning
John Nguyen
John Nguyen
Meta AI, FAIR (Fundamental AI Research)
Federated LearningNatural Language ProcessingComputer VisionArtificial Intelligence
Ellis Brown
Ellis Brown
PhD, New York University
AIDeep LearningComputer VisionRepresentation Learning
G
Gaoyue Zhou
FAIR, Meta; New York University
Shengyi Qian
Shengyi Qian
Research Scientist, Meta FAIR
Computer VisionVision Language ModelRobotics
Boyang Zheng
Boyang Zheng
New York University
Computer VisionGenerative ModelsMulti-modal learning
Théophane Vallaeys
Théophane Vallaeys
Intern at Meta
Junlin Han
Junlin Han
Meta AI | University of Oxford
Computer visionMachine LearningArtificial Intelligence
Rob Fergus
Rob Fergus
Professor of Computer Science, New York University
Machine LearningProtein Design
Naila Murray
Naila Murray
Facebook AI Research
artificial intelligencemachine learningcomputer vision
Marjan Ghazvininejad
Marjan Ghazvininejad
Research Scientist, FAIR (Facebook AI Research)
Natural Language ProcessingMachine Learning
Mike Lewis
Mike Lewis
Facebook AI Research
Natural language processingmachine learninglinguistics
Nicolas Ballas
Nicolas Ballas
Meta AI Research
Amir Bar
Amir Bar
Meta (FAIR)
Computer VisionArtificial IntelligenceAIMachine LearningDeep Learning
Michael Rabbat
Michael Rabbat
Research Scientist, FAIR at Meta
Self-Supervised LearningMachine LearningOptimizationSignal ProcessingDistributed Computation
Jakob Verbeek
Jakob Verbeek
FAIR, Meta
Machine LearningComputer VisionArtificial Intelligence
Luke Zettlemoyer
Luke Zettlemoyer
University of Washington; Meta
Natural Language ProcessingSemanticsMachine LearningArtificial Intelligence
Koustuv Sinha
Koustuv Sinha
Research Scientist, Meta AI (Fundamental AI Research), McGill University (MSc, PhD)
language generationlanguage reasoninggraph neural networkssystematic generalization
Yann LeCun
Yann LeCun
Chief AI Scientist at Facebook & JT Schwarz Professor at the Courant Institute, New York University
AImachine learningcomputer visionroboticsimage compression
Saining Xie
Saining Xie
Courant Institute, New York University
computer visionmachine learningrepresentation learningartificial intelligence