Scaling Laws for Native Multimodal Models Scaling Laws for Native Multimodal Models

📅 2025-04-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work challenges the prevailing assumption that late-fusion native multimodal models (NMMs)—requiring separate visual encoders—are inherently superior, investigating the architectural advantages of simpler alternatives. Method: Through systematic scaling experiments across 457 models, we comparatively evaluate early- and late-fusion paradigms; we further propose a Mixture-of-Experts (MoE)-enhanced early-fusion architecture enabling modality-adaptive weight learning. Contribution/Results: We demonstrate that early-fusion NMMs significantly outperform late-fusion counterparts under low-parameter regimes: eliminating the image encoder reduces parameter count, improves training efficiency by 37%, and lowers inference latency by 29%. The MoE-augmented variant further boosts average accuracy by 5.2%. Collectively, our findings establish end-to-end early fusion as a more efficient and scalable NMM paradigm, providing both theoretical grounding and practical design principles for next-generation multimodal foundation models.

Technology Category

Application Category

📝 Abstract
Building general-purpose models that can effectively perceive the world through multimodal signals has been a long-standing goal. Current approaches involve integrating separately pre-trained components, such as connecting vision encoders to LLMs and continuing multimodal training. While such approaches exhibit remarkable sample efficiency, it remains an open question whether such late-fusion architectures are inherently superior. In this work, we revisit the architectural design of native multimodal models (NMMs)--those trained from the ground up on all modalities--and conduct an extensive scaling laws study, spanning 457 trained models with different architectures and training mixtures. Our investigation reveals no inherent advantage to late-fusion architectures over early-fusion ones, which do not rely on image encoders. On the contrary, early-fusion exhibits stronger performance at lower parameter counts, is more efficient to train, and is easier to deploy. Motivated by the strong performance of the early-fusion architectures, we show that incorporating Mixture of Experts (MoEs) allows for models that learn modality-specific weights, significantly enhancing performance.
Problem

Research questions and friction points this paper is trying to address.

Investigating superiority of late-fusion vs early-fusion multimodal architectures
Analyzing scaling laws for 457 native multimodal models (NMMs)
Enhancing performance via Mixture of Experts (MoEs) in early-fusion
Innovation

Methods, ideas, or system contributions that make the work stand out.

Early-fusion architectures outperform late-fusion designs
Mixture of Experts enhances modality-specific learning
Scaling study with 457 models guides architectural choice
🔎 Similar Papers
No similar papers found.