Eagle 2: Building Post-Training Data Strategies from Scratch for Frontier Vision-Language Models

📅 2025-01-20

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

This work addresses the critical issues of opacity and irreproducibility in post-training data strategies for open-source vision-language models (VLMs). We propose the first systematic, decoupled, and publicly disclosed data-driven paradigm—explicitly separating and optimizing four core stages: data selection, composition balancing, cleaning, and scheduling. Methodologically, our approach integrates multi-source heterogeneous data distillation, curriculum-based sequencing, cross-task alignment augmentation, lightweight instruction tuning, and mixed-precision training. The resulting transparent and reproducible data strategy significantly enhances training efficiency and model performance. On multiple mainstream multimodal benchmarks, our 9B-parameter Eagle2-9B achieves state-of-the-art results, matching the performance of proprietary 70B-scale competitors. This demonstrates conclusively that high-fidelity data engineering is pivotal to unlocking substantial performance gains in open-source VLMs.

Technology Category

Application Category

📝 Abstract

Recently, promising progress has been made by open-source vision-language models (VLMs) in bringing their capabilities closer to those of proprietary frontier models. However, most open-source models only publish their final model weights, leaving the critical details of data strategies and implementation largely opaque. In this work, we address VLM post-training from a data-centric perspective, showing the key role of data strategy in developing frontier VLMs. By studying and building our post-training data strategy from scratch, we share detailed insights into the development processes, aiming to benefit the development of competitive models for the open-source community. Our introduced data strategy, together with training recipes and model design, leads to a family of performant VLMs named Eagle2. Specifically, Eagle2-9B achieves state-of-the-art results across various multimodal benchmarks, matching certain competitive models with up to 70B parameters.

Problem

Research questions and friction points this paper is trying to address.

Data Processing

Visual Language Models

Performance Enhancement

Innovation

Methods, ideas, or system contributions that make the work stand out.

Data Processing

VLM Model Optimization

Efficient Model Design

🔎 Similar Papers

Exploring the Frontier of Vision-Language Models: A Survey of Current Methodologies and Future Directions