Cream of the Crop: Harvesting Rich, Scalable and Transferable Multi-Modal Data for Instruction Fine-Tuning

📅 2025-03-17

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

To address the instability, poor generalizability, and high scalability cost of high-quality instruction data selection in multimodal large language model (MLLM) instruction tuning, this paper proposes a training-agnostic, budget-scalable data selection framework. Methodologically, it introduces: (1) the first 14-dimensional vision–language capability-decoupled quality metric; (2) multimodal “rich scorers” and “rich stylers” optimized for interactive style diversity; and (3) non-embedding-based lightweight ranking coupled with budget-aware dynamic sampling. Evaluated across 14 cross-domain benchmarks, the framework achieves 99.1% of full-dataset (2.6M samples) performance using only 30% (780K samples), substantially outperforming both random sampling and existing state-of-the-art methods.

Technology Category

Application Category

📝 Abstract

The hypothesis that pretrained large language models (LLMs) necessitate only minimal supervision during the fine-tuning (SFT) stage (Zhou et al., 2024) has been substantiated by recent advancements in data curation and selection research. However, their stability and generalizability are compromised due to the vulnerability to experimental setups and validation protocols, falling short of surpassing random sampling (Diddee&Ippolito, 2024; Xia et al., 2024b). Built upon LLMs, multi-modal LLMs (MLLMs), combined with the sheer token volume and heightened heterogeneity of data sources, amplify both the significance and complexity of data selection. To harvest multi-modal instructional data in a robust and efficient manner, we re-define the granularity of the quality metric by decomposing it into 14 vision-language-related capabilities, and introduce multi-modal rich scorers to evaluate the capabilities of each data candidate. To promote diversity, in light of the inherent objective of the alignment stage, we take interaction style as diversity indicator and use a multi-modal rich styler to identify data instruction patterns. In doing so, our multi-modal rich scorers and styler (mmSSR) guarantee that high-scoring information is conveyed to users in diversified forms. Free from embedding-based clustering or greedy sampling, mmSSR efficiently scales to millions of data with varying budget constraints, supports customization for general or specific capability acquisition, and facilitates training-free generalization to new domains for curation. Across 10+ experimental settings, validated by 14 multi-modal benchmarks, we demonstrate consistent improvements over random sampling, baseline strategies and state-of-the-art selection methods, achieving 99.1% of full performance with only 30% of the 2.6M data.

Problem

Research questions and friction points this paper is trying to address.

Enhances multi-modal data selection for fine-tuning large language models.

Introduces multi-modal rich scorers to evaluate data quality and diversity.

Scales efficiently to millions of data points with varying budget constraints.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Decomposes quality into 14 vision-language capabilities

Uses multi-modal rich scorers and stylers (mmSSR)

Scales efficiently to millions of data points

🔎 Similar Papers

No similar papers found.