MLLM-Selector: Necessity and Diversity-driven High-Value Data Selection for Enhanced Visual Instruction Tuning

📅 2025-03-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the ambiguity in defining high-quality multimodal data and the labor-intensive nature of manual curation in Vision Instruction Tuning (VIT), this paper proposes the first automated high-value data selection framework that jointly optimizes necessity and diversity. Methodologically, it computes a necessity score for each sample based on a seed model, employs KL-divergence-driven diversity sampling, and integrates a lightweight instruction-tuning闭环 evaluation loop. The core contribution lies in the first deep integration of quantitative necessity measurement with diversity-aware optimization, establishing a reusable, principled paradigm for multimodal data value assessment—surpassing conventional uniform or heuristic sampling strategies. Experiments demonstrate that models trained on <1% of the full dataset achieve competitive or superior performance over LLaVA-1.5 on several metrics, while using <50% of the data yields consistently state-of-the-art results across all benchmarks.

Technology Category

Application Category

📝 Abstract
Visual instruction tuning (VIT) has emerged as a crucial technique for enabling multi-modal large language models (MLLMs) to follow user instructions adeptly. Yet, a significant gap persists in understanding the attributes of high-quality instruction tuning data and frameworks for its automated selection. To address this, we introduce MLLM-Selector, an automated approach that identifies valuable data for VIT by weighing necessity and diversity. Our process starts by randomly sampling a subset from the VIT data pool to fine-tune a pretrained model, thus creating a seed model with an initial ability to follow instructions. Then, leveraging the seed model, we calculate necessity scores for each sample in the VIT data pool to identify samples pivotal for enhancing model performance. Our findings underscore the importance of mixing necessity and diversity in data choice, leading to the creation of MLLM-Selector, our methodology that fuses necessity scoring with strategic sampling for superior data refinement. Empirical results indicate that within identical experimental conditions, MLLM-Selector surpasses LLaVA-1.5 in some benchmarks with less than 1% of the data and consistently exceeds performance across all validated benchmarks when using less than 50%.
Problem

Research questions and friction points this paper is trying to address.

Identifying high-quality data for visual instruction tuning
Automating selection of necessary and diverse VIT data
Enhancing MLLM performance with minimal data usage
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated high-value data selection for VIT
Necessity and diversity-driven scoring method
Strategic sampling enhances model performance
🔎 Similar Papers
No similar papers found.
Yiwei Ma
Yiwei Ma
Stevens Institute of Technology
Guohai Xu
Guohai Xu
Xiaohongshu Inc., Alibaba DAMO Academy
MLLMAlignment
X
Xiaoshuai Sun
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, Fujian, P.R. China.
Jiayi Ji
Jiayi Ji
Rutgers University
Jie Lou
Jie Lou
Xiaohongshu
AlignmentRLHF
Debing Zhang
Debing Zhang
Xiaohongshu
Machine LearningComputer VisionDeep Learning
R
Rongrong Ji
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, Fujian, P.R. China.