Curriculum Learning with Quality-Driven Data Selection

📅 2024-06-27
🏛️ arXiv.org
📈 Citations: 2
Influential: 0
📄 PDF

career value

199K/year
🤖 AI Summary
To address the lack of controllable quality assessment for instruction data in vision-instruction fine-tuning of multimodal large language models (MLLMs), this paper introduces the first two-dimensional quality evaluation space grounded in image–text relevance and model perplexity, enabling interpretable quantification and hierarchical filtering of data quality. Methodologically, we innovatively integrate cross-modal alignment modeling with intrinsic uncertainty estimation from language models, supporting task-specific prompt-type analysis of quality distributions and constructing a curriculum-learning subset with progressively increasing quality. Empirically, training solely on the high-quality subset achieves significant performance gains over the full-dataset baseline across five mainstream benchmarks. All code, data, and models are publicly released.

Technology Category

Application Category

📝 Abstract
The impressive multimodal capabilities demonstrated by OpenAI's GPT-4 have generated significant interest in the development of Multimodal Large Language Models (MLLMs). Visual instruction tuning of MLLMs with machine-generated instruction-following data has shown to enhance zero-shot capabilities across various tasks. However, there has been limited exploration into controlling the quality of the instruction data.Current methodologies for data selection in MLLMs often rely on single, unreliable scores or use downstream tasks for selection, which is time-consuming and can lead to potential overfitting on the chosen evaluation datasets. To mitigate these limitations, we propose a novel data selection methodology that utilizes image-text correlation and model perplexity to evaluate and select data of varying quality. This approach leverages the distinct distribution of these two attributes, mapping data quality into a two-dimensional space that allows for the selection of data based on their location within this distribution. By utilizing this space, we can analyze the impact of task type settings, used as prompts, on data quality. Additionally, this space can be used to construct multi-stage subsets of varying quality to facilitate curriculum learning. Our research includes comprehensive experiments conducted on various datasets. The results emphasize substantial enhancements in five commonly assessed capabilities compared to using the complete dataset. Our codes, data, and models are publicly available at: url{https://anonymous.4open.science/r/EHIT-31B4}
Problem

Research questions and friction points this paper is trying to address.

Controls quality of instruction data in MLLMs
Mitigates overfitting and time-consuming data selection
Enhances zero-shot capabilities via curriculum learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses image-text correlation for data selection
Employs model perplexity to assess data quality
Constructs multi-stage subsets for curriculum learning