🤖 AI Summary
This work addresses the critical issues of opacity and irreproducibility in post-training data strategies for open-source vision-language models (VLMs). We propose the first systematic, decoupled, and publicly disclosed data-driven paradigm—explicitly separating and optimizing four core stages: data selection, composition balancing, cleaning, and scheduling. Methodologically, our approach integrates multi-source heterogeneous data distillation, curriculum-based sequencing, cross-task alignment augmentation, lightweight instruction tuning, and mixed-precision training. The resulting transparent and reproducible data strategy significantly enhances training efficiency and model performance. On multiple mainstream multimodal benchmarks, our 9B-parameter Eagle2-9B achieves state-of-the-art results, matching the performance of proprietary 70B-scale competitors. This demonstrates conclusively that high-fidelity data engineering is pivotal to unlocking substantial performance gains in open-source VLMs.
📝 Abstract
Recently, promising progress has been made by open-source vision-language models (VLMs) in bringing their capabilities closer to those of proprietary frontier models. However, most open-source models only publish their final model weights, leaving the critical details of data strategies and implementation largely opaque. In this work, we address VLM post-training from a data-centric perspective, showing the key role of data strategy in developing frontier VLMs. By studying and building our post-training data strategy from scratch, we share detailed insights into the development processes, aiming to benefit the development of competitive models for the open-source community. Our introduced data strategy, together with training recipes and model design, leads to a family of performant VLMs named Eagle2. Specifically, Eagle2-9B achieves state-of-the-art results across various multimodal benchmarks, matching certain competitive models with up to 70B parameters.