LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training

📅 2025-09-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the high computational cost, poor reproducibility, and data bias inherent in multimodal model training. We introduce LLaVA-OneVision, an open-source framework featuring: (1) a concept-balanced data curation methodology and offline parallel data packing strategy to enhance data quality and I/O efficiency; (2) an end-to-end reproducible training pipeline enabling from-scratch vision-language model training; and (3) an efficient training paradigm leveraging 85M pretraining and 26M instruction-tuning samples—totaling 64B compressed multimodal tokens. The entire training process is completed within a $16,000 budget, substantially lowering entry barriers. Evaluated across 27 benchmarks, LLaVA-OneVision-1.5-8B outperforms Qwen2.5-VL-7B on 18 tasks, while the 1.5-4B variant surpasses Qwen2.5-VL-3B across all metrics, achieving state-of-the-art performance.

Technology Category

Application Category

📝 Abstract
We present LLaVA-OneVision-1.5, a novel family of Large Multimodal Models (LMMs) that achieve state-of-the-art performance with significantly reduced computational and financial costs. Different from the existing works, LLaVA-OneVision-1.5 provides an open, efficient, and reproducible framework for building high-quality vision-language models entirely from scratch. The LLaVA-OneVision-1.5 release comprises three primary components: (1) Large-Scale Curated Datasets: We construct an 85M concept-balanced pretraining dataset LLaVA-OneVision-1.5-Mid-Traning and a meticulously curated 26M instruction dataset LLaVA-OneVision-1.5-Instruct, collectively encompassing 64B compressed multimodal tokens. (2) Efficient Training Framework: We develop a complete end-to-end efficient training framework leveraging an offline parallel data packing strategy to facilitate the training of LLaVA-OneVision-1.5 within a $16,000 budget. (3) State-of-the-art Performance: Experimental results demonstrate that LLaVA-OneVision1.5 yields exceptionally competitive performance across a broad range of downstream tasks. Specifically, LLaVA-OneVision-1.5-8B outperforms Qwen2.5-VL-7B on 18 of 27 benchmarks, and LLaVA-OneVision-1.5-4B surpasses Qwen2.5-VL-3B on all 27 benchmarks. We anticipate releasing LLaVA-OneVision-1.5-RL shortly and encourage the community to await further updates.
Problem

Research questions and friction points this paper is trying to address.

Democratizing multimodal training with open frameworks
Reducing computational and financial costs for LMMs
Building high-quality vision-language models from scratch
Innovation

Methods, ideas, or system contributions that make the work stand out.

Open framework for multimodal training from scratch
Efficient training with offline parallel data packing
Large-scale curated datasets with concept-balanced tokens
🔎 Similar Papers
No similar papers found.
Xiang An
Xiang An
DeepGlint
Computer Vision
Y
Yin Xie
LLaVA-OneVision Community Contributors
Kaicheng Yang
Kaicheng Yang
DeepGlint
Multimodal、CV、NLP
Wenkang Zhang
Wenkang Zhang
Shanghai Jiao Tong University
3D VisionEmbodied AIWorld ModelLearning-based Compression
X
Xiuwei Zhao
LLaVA-OneVision Community Contributors
Z
Zheng Cheng
LLaVA-OneVision Community Contributors
Yirui Wang
Yirui Wang
Amazon
Object DetectionTrackingMedical Image AnalysisComputer-aided Diagnosis
S
Songcen Xu
LLaVA-OneVision Community Contributors
C
Changrui Chen
LLaVA-OneVision Community Contributors
C
Chunsheng Wu
LLaVA-OneVision Community Contributors
Huajie Tan
Huajie Tan
Peking University
Embodied AIFoundation Models
Chunyuan Li
Chunyuan Li
xAI
Deep LearningVisionLanguageMultimodal
J
Jing Yang
LLaVA-OneVision Community Contributors
J
Jie Yu
LLaVA-OneVision Community Contributors
Xiyao Wang
Xiyao Wang
Ph.D. in University of Maryland, College Park
World ModelEmbodied AIMultimodel LLM
Bin Qin
Bin Qin
Institute of Software Chinese Academy of Sciences
Machine LearningCausal Inference
Y
Yumeng Wang
LLaVA-OneVision Community Contributors
Z
Zizhen Yan
LLaVA-OneVision Community Contributors
Z
Ziyong Feng
LLaVA-OneVision Community Contributors
Ziwei Liu
Ziwei Liu
Associate Professor, Nanyang Technological University
Computer VisionMachine LearningComputer Graphics
B
Bo Li
LLaVA-OneVision Community Contributors
Jiankang Deng
Jiankang Deng
Imperial College London
Computer VisionMachine Learning