Filter Images First, Generate Instructions Later: Pre-Instruction Data Selection for Visual Instruction Tuning

📅 2025-03-10

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

To address the high cost and human annotation dependency of instruction generation in Vision Instruction Tuning (VIT), which severely limits scalability, this paper proposes Pre-instruction Data Selection (PreSel): a paradigm that decouples image selection from instruction generation by performing image filtering *before* instruction synthesis. PreSel allocates sampling budgets based on task importance estimation and jointly leverages image-feature clustering and cross-task representativeness scoring to efficiently identify the most informative unlabeled images. Applying instruction generation to only 15% of the selected images achieves full-data fine-tuning performance on LLaVA-1.5 and Vision-Flan, substantially reducing both instruction-generation and training overhead. To our knowledge, PreSel is the first VIT data selection framework that is task-aware, budget-controllable, and annotation-free. It establishes a new paradigm for efficient multimodal model adaptation under resource constraints.

Technology Category

Application Category

📝 Abstract

Visual instruction tuning (VIT) for large vision-language models (LVLMs) requires training on expansive datasets of image-instruction pairs, which can be costly. Recent efforts in VIT data selection aim to select a small subset of high-quality image-instruction pairs, reducing VIT runtime while maintaining performance comparable to full-scale training. However, a major challenge often overlooked is that generating instructions from unlabeled images for VIT is highly expensive. Most existing VIT datasets rely heavily on human annotations or paid services like the GPT API, which limits users with constrained resources from creating VIT datasets for custom applications. To address this, we introduce Pre-Instruction Data Selection (PreSel), a more practical data selection paradigm that directly selects the most beneficial unlabeled images and generates instructions only for the selected images. PreSel first estimates the relative importance of each vision task within VIT datasets to derive task-wise sampling budgets. It then clusters image features within each task, selecting the most representative images with the budget. This approach reduces computational overhead for both instruction generation during VIT data formation and LVLM fine-tuning. By generating instructions for only 15% of the images, PreSel achieves performance comparable to full-data VIT on the LLaVA-1.5 and Vision-Flan datasets. The link to our project page: https://bardisafa.github.io/PreSel

Problem

Research questions and friction points this paper is trying to address.

Reduces cost of generating instructions for VIT datasets

Selects high-quality unlabeled images for instruction generation

Maintains VIT performance with fewer image-instruction pairs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Pre-Instruction Data Selection reduces computational overhead.

Clusters image features to select representative images.

Generates instructions for only 15% of images.

🔎 Similar Papers

No similar papers found.