NaViL: Rethinking Scaling Properties of Native Multimodal Large Language Models under Data Constraints

📅 2025-10-09

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

Existing large multimodal language models (MLLMs) predominantly adopt a modular composition paradigm—separately pretraining vision encoders and language models—which hinders systematic investigation of multimodal co-scaling laws. Method: We conduct the first end-to-end native training study under data-constrained conditions, comprehensively exploring the MLLM design space and scaling behavior. We identify a positive correlation between vision encoder and language model scaling, and accordingly propose a lightweight cross-modal connector and a data-efficient training strategy. Contribution/Results: Our approach achieves performance on par with state-of-the-art compositional models across 14 mainstream multimodal benchmarks, while significantly reducing training cost. This validates that native end-to-end training offers a superior trade-off among scalability, efficiency, and performance—establishing a new paradigm for efficient, scalable MLLM development.

Technology Category

Application Category

📝 Abstract

Compositional training has been the de-facto paradigm in existing Multimodal Large Language Models (MLLMs), where pre-trained vision encoders are connected with pre-trained LLMs through continuous multimodal pre-training. However, the multimodal scaling property of this paradigm remains difficult to explore due to the separated training. In this paper, we focus on the native training of MLLMs in an end-to-end manner and systematically study its design space and scaling property under a practical setting, i.e., data constraint. Through careful study of various choices in MLLM, we obtain the optimal meta-architecture that best balances performance and training cost. After that, we further explore the scaling properties of the native MLLM and indicate the positively correlated scaling relationship between visual encoders and LLMs. Based on these findings, we propose a native MLLM called NaViL, combined with a simple and cost-effective recipe. Experimental results on 14 multimodal benchmarks confirm the competitive performance of NaViL against existing MLLMs. Besides that, our findings and results provide in-depth insights for the future study of native MLLMs.

Problem

Research questions and friction points this paper is trying to address.

Studying native multimodal model scaling under data constraints

Optimizing architecture balance between performance and training cost

Exploring scaling relationship between visual encoders and language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Native end-to-end training for MLLMs

Optimal meta-architecture balancing performance and cost

Scaling relationship between visual encoders and LLMs

🔎 Similar Papers

No similar papers found.