🤖 AI Summary
Visual instruction fine-tuning (VIF) faces challenges of large-scale data, difficulty in ensuring image-text quality and alignment, and insufficient research on unsupervised data selection. This paper proposes an efficient unsupervised data selection and augmentation framework. First, it identifies salient visual regions via high-response areas in self-attention maps; then, it masks corresponding hidden states and quantifies image-text pair quality by the loss change (Δ) before and after masking—requiring no annotations, auxiliary models, or additional training, and exhibiting cross-model and cross-dataset generalizability. Furthermore, contrastive learning is incorporated to refine the selection strategy. Experiments demonstrate that, using only 20% of the data, training speed increases fivefold while accuracy surpasses the full-data baseline by 10.1%, achieving state-of-the-art performance across multiple vision-language models and datasets.
📝 Abstract
Visual Instruction Finetuning (VIF) is pivotal for post-training Vision-Language Models (VLMs). Unlike unimodal instruction finetuning in plain-text large language models, which mainly requires instruction datasets to enable model instruction-following ability, VIF also requires multimodal data to enable joint visual and textual understanding; therefore, it typically requires more data. Consequently, VIF imposes stricter data selection challenges: the method must scale efficiently to handle larger data demands while ensuring the quality of both visual and textual content, as well as their alignment. Despite its critical impact on performance, data selection for VIF remains an understudied area. In this paper, we propose $Δ$-AttnMask. This data-efficient framework quantifies sample quality through attention-guided masking of the model's hidden states, jointly evaluating image-text pairs without requiring domain labels, auxiliary models, or extra training. By computing loss differences ($Δ$) between the original states and states masked using high-attention regions, $Δ$-AttnMask intrinsically assesses sample quality. Experiments across multiple VLMs and datasets show that $Δ$-AttnMask achieves state-of-the-art performance with just 20% of data, accelerating training by 5x while surpassing full-dataset baselines by +10.1% in overall accuracy. Its model-agnostic and data-agnostic design ensures broad applicability across modalities and architectures.