Does the Question Really Matter? Training-Free Data Selection for Vision-Language SFT

📅 2026-03-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing vision-language instruction tuning datasets often rely on linguistic patterns or commonsense shortcuts, hindering genuine cross-modal reasoning. To address this limitation, this work proposes CVS, a novel method that evaluates data quality based on the impact of a question on answer validity. Leveraging a frozen vision-language foundation model, CVS establishes a training-free assessment mechanism that quantifies the change in answer validity before and after introducing a question, thereby selecting high-quality samples requiring joint visual and linguistic reasoning. Experiments demonstrate that CVS achieves a 3.5%–4.8% performance gain over full-data training on Vision-Flan using only 10%–15% of the data, maintains robustness on Cauldron, and reduces computational costs by 17.3% and 44.4% compared to COINCIDE and XMAS, respectively.

Technology Category

Application Category

📝 Abstract
Visual instruction tuning is crucial for improving vision-language large models (VLLMs). However, many samples can be solved via linguistic patterns or common-sense shortcuts, without genuine cross-modal reasoning, limiting the effectiveness of multimodal learning. Prior data selection methods often rely on costly proxy model training and focus on difficulty or diversity, failing to capture a sample's true contribution to vision-language joint reasoning. In this paper, we propose CVS, a training-free data selection method based on the insight that, for high-quality multimodal samples, introducing the question should substantially alter the model's assessment of answer validity given an image. CVS leverages a frozen VLLM as an evaluator and measures the discrepancy in answer validity with and without conditioning on the question, enabling the identification of samples that require vision-language joint reasoning while filtering semantic-conflict noise. Experiments on Vision-Flan and The Cauldron show that CVS achieves solid performance across datasets. On Vision-Flan, CVS outperforms full-data training by 3.5% and 4.8% using only 10% and 15% of the data, respectively, and remains robust on the highly heterogeneous Cauldron dataset. Moreover, CVS reduces computational cost by 17.3% and 44.4% compared to COINCIDE and XMAS.
Problem

Research questions and friction points this paper is trying to address.

vision-language reasoning
data selection
instruction tuning
multimodal learning
cross-modal reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

training-free data selection
vision-language reasoning
conditional validity shift
instruction tuning
multimodal learning
🔎 Similar Papers
No similar papers found.