🤖 AI Summary
This study addresses the high computational cost, poor reproducibility, and data bias inherent in multimodal model training. We introduce LLaVA-OneVision, an open-source framework featuring: (1) a concept-balanced data curation methodology and offline parallel data packing strategy to enhance data quality and I/O efficiency; (2) an end-to-end reproducible training pipeline enabling from-scratch vision-language model training; and (3) an efficient training paradigm leveraging 85M pretraining and 26M instruction-tuning samples—totaling 64B compressed multimodal tokens. The entire training process is completed within a $16,000 budget, substantially lowering entry barriers. Evaluated across 27 benchmarks, LLaVA-OneVision-1.5-8B outperforms Qwen2.5-VL-7B on 18 tasks, while the 1.5-4B variant surpasses Qwen2.5-VL-3B across all metrics, achieving state-of-the-art performance.
📝 Abstract
We present LLaVA-OneVision-1.5, a novel family of Large Multimodal Models (LMMs) that achieve state-of-the-art performance with significantly reduced computational and financial costs. Different from the existing works, LLaVA-OneVision-1.5 provides an open, efficient, and reproducible framework for building high-quality vision-language models entirely from scratch. The LLaVA-OneVision-1.5 release comprises three primary components: (1) Large-Scale Curated Datasets: We construct an 85M concept-balanced pretraining dataset LLaVA-OneVision-1.5-Mid-Traning and a meticulously curated 26M instruction dataset LLaVA-OneVision-1.5-Instruct, collectively encompassing 64B compressed multimodal tokens. (2) Efficient Training Framework: We develop a complete end-to-end efficient training framework leveraging an offline parallel data packing strategy to facilitate the training of LLaVA-OneVision-1.5 within a $16,000 budget. (3) State-of-the-art Performance: Experimental results demonstrate that LLaVA-OneVision1.5 yields exceptionally competitive performance across a broad range of downstream tasks. Specifically, LLaVA-OneVision-1.5-8B outperforms Qwen2.5-VL-7B on 18 of 27 benchmarks, and LLaVA-OneVision-1.5-4B surpasses Qwen2.5-VL-3B on all 27 benchmarks. We anticipate releasing LLaVA-OneVision-1.5-RL shortly and encourage the community to await further updates.