Once-For-All: A Train-Once and Select-Anytime Framework for Multimodal Instruction Tuning

πŸ“… 2026-05-26
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the limitations of multimodal instruction fine-tuning, which is often hindered by data redundancy and existing data selection methods that are either model- or dataset-specific, lacking generalizability and incurring high computational costs. The authors propose OFA, a framework that clusters instruction data within a frozen CLIP embedding space to generate pseudo-labels and trains a lightweight, reusable selector. By leveraging uncertainty-based sampling, OFA identifies the most informative samples for training. This approach establishes, for the first time, a β€œtrain-once, reuse-anywhere” data selection mechanism that decouples the selection process from the target model. Experiments demonstrate that OFA achieves 98.3% of full-data performance using only 15% of the data and even surpasses full-data training by 10.6% on the unseen Vision-Flan-186K dataset, showing consistent effectiveness across diverse models such as Qwen2.5-VL-3B and LLaVA-v1.5-7B.
πŸ“ Abstract
Multimodal instruction tuning is the de facto recipe for adapting vision language models (VLMs), yet instruction data are highly redundant, making data selection critical for training efficiency. Existing methods derive selection signals from a specific model or dataset, so whenever the target model or candidate pool changes, the criteria must be recomputed from scratch at substantial cost. To address this, we propose OFA, a data selection framework that trains a reusable selector once and applies it to any dataset or model without recomputation. OFA clusters multimodal instructions in a frozen CLIP space, derives pseudo labels from the cluster structure, and trains a lightweight selector for only a few epochs; samples on which this selector is least confident are selected as the most informative. Once trained, the frozen selector transfers directly across datasets and model scales. The selector is trained once on LLaVA-665K and applied both to LLaVA-665K itself and, without any retraining, to the unseen Vision-Flan-186K. Selecting only 15% of the data, OFA achieves 98.3% of full data performance across 10 downstream benchmarks; on the smaller Vision-Flan-186K, the transferred selector surpasses full data training by 10.6%, confirming that the learned signal generalizes to datasets never seen during selector training. The same selected subsets benefit VLMs at both Qwen2.5-VL-3B and LLaVA-v1.5-7B without per model recomputation, decoupling selection from the target model. These results demonstrate that a single, transferable selector provides an effective and reusable solution for efficient multimodal instruction tuning.
Problem

Research questions and friction points this paper is trying to address.

multimodal instruction tuning
data selection
vision language models
redundant data
model-agnostic selection
Innovation

Methods, ideas, or system contributions that make the work stand out.

data selection
multimodal instruction tuning
transferable selector
vision-language models
training efficiency