LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training

📅 2025-09-28

📈 Citations: 0

✨ Influential: 0

career value

255K/year

🤖 AI Summary

This study addresses the high computational cost, poor reproducibility, and data bias inherent in multimodal model training. We introduce LLaVA-OneVision, an open-source framework featuring: (1) a concept-balanced data curation methodology and offline parallel data packing strategy to enhance data quality and I/O efficiency; (2) an end-to-end reproducible training pipeline enabling from-scratch vision-language model training; and (3) an efficient training paradigm leveraging 85M pretraining and 26M instruction-tuning samples—totaling 64B compressed multimodal tokens. The entire training process is completed within a $16,000 budget, substantially lowering entry barriers. Evaluated across 27 benchmarks, LLaVA-OneVision-1.5-8B outperforms Qwen2.5-VL-7B on 18 tasks, while the 1.5-4B variant surpasses Qwen2.5-VL-3B across all metrics, achieving state-of-the-art performance.

Technology Category

Application Category

📝 Abstract

We present LLaVA-OneVision-1.5, a novel family of Large Multimodal Models (LMMs) that achieve state-of-the-art performance with significantly reduced computational and financial costs. Different from the existing works, LLaVA-OneVision-1.5 provides an open, efficient, and reproducible framework for building high-quality vision-language models entirely from scratch. The LLaVA-OneVision-1.5 release comprises three primary components: (1) Large-Scale Curated Datasets: We construct an 85M concept-balanced pretraining dataset LLaVA-OneVision-1.5-Mid-Traning and a meticulously curated 26M instruction dataset LLaVA-OneVision-1.5-Instruct, collectively encompassing 64B compressed multimodal tokens. (2) Efficient Training Framework: We develop a complete end-to-end efficient training framework leveraging an offline parallel data packing strategy to facilitate the training of LLaVA-OneVision-1.5 within a $16,000 budget. (3) State-of-the-art Performance: Experimental results demonstrate that LLaVA-OneVision1.5 yields exceptionally competitive performance across a broad range of downstream tasks. Specifically, LLaVA-OneVision-1.5-8B outperforms Qwen2.5-VL-7B on 18 of 27 benchmarks, and LLaVA-OneVision-1.5-4B surpasses Qwen2.5-VL-3B on all 27 benchmarks. We anticipate releasing LLaVA-OneVision-1.5-RL shortly and encourage the community to await further updates.

Problem

Research questions and friction points this paper is trying to address.

Democratizing multimodal training with open frameworks

Reducing computational and financial costs for LMMs

Building high-quality vision-language models from scratch

Innovation

Methods, ideas, or system contributions that make the work stand out.

Open framework for multimodal training from scratch

Efficient training with offline parallel data packing

Large-scale curated datasets with concept-balanced tokens

🔎 Similar Papers

No similar papers found.