ScalSelect: Scalable Training-Free Multimodal Data Selection for Efficient Visual Instruction Tuning

📅 2026-02-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the inefficiency in large-scale vision-language instruction tuning caused by data redundancy. To this end, the authors propose a training-free multimodal data selection method that leverages the attention mechanism over instruction tokens in the target vision-language model to extract salient visual features and construct sample representations. Importance scores are then computed via projection onto a principal subspace, achieving linear time complexity. Notably, the approach requires neither external models nor auxiliary data, yet effectively captures instruction-relevant semantics. Experiments demonstrate that using only 16% of the original data, the method attains over 97.5% of the full-data performance—and in some cases even surpasses it—highlighting its efficacy in improving training efficiency without compromising model quality.

Technology Category

Application Category

📝 Abstract
Large-scale Visual Instruction Tuning (VIT) has become a key paradigm for advancing the performance of vision-language models (VLMs) across various multimodal tasks. However, training on the large-scale datasets is computationally expensive and inefficient due to redundancy in the data, which motivates the need for multimodal data selection to improve training efficiency. Existing data selection methods for VIT either require costly training or gradient computation. Training-free alternatives often depend on proxy models or datasets, instruction-agnostic representations, and pairwise similarity with quadratic complexity, limiting scalability and representation fidelity. In this work, we propose ScalSelect, a scalable training-free multimodal data selection method with linear-time complexity with respect to the number of samples, eliminating the need for external models or auxiliary datasets. ScalSelect first constructs sample representations by extracting visual features most attended by instruction tokens in the target VLM, capturing instruction-relevant information. It then identifies samples whose representations best approximate the dominant subspace of the full dataset representations, enabling scalable importance scoring without pairwise comparisons. Extensive experiments across multiple VLMs, datasets, and selection budgets demonstrate that ScalSelect achieves over 97.5% of the performance of training on the full dataset using only 16% of the data, and even outperforms full-data training in some settings. The code is available at \href{https://github.com/ChangtiWu/ScalSelect}{ScalSelect}.
Problem

Research questions and friction points this paper is trying to address.

Visual Instruction Tuning
Multimodal Data Selection
Training-Free
Scalability
Data Efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

training-free
multimodal data selection
visual instruction tuning
linear-time complexity
instruction-aware representation
🔎 Similar Papers
No similar papers found.
C
Changti Wu
East China Normal University; Zhongguancun Academy
J
Jiahuai Mao
The Hong Kong Polytechnic University
Y
Yuzhuo Miao
Zhongguancun Academy; Harbin Institute of Technology
S
Shijie Lian
Zhongguancun Academy; Huazhong University of Science and Technology
B
Bin Yu
Zhongguancun Academy; Harbin Institute of Technology
X
Xiaopeng Lin
The Hong Kong University of Science and Technology (Guangzhou); Zhongguancun Institute of Artificial Intelligence
Cong Huang
Cong Huang
University of Science and Technology of China
Image/Video processing
Lei Zhang
Lei Zhang
East China Normal University
Information SecurityCryptographyVANETCloud ComputingData Privacy
Kai Chen
Kai Chen
Institute of Information Engineering, Chinese Academy of Sciences
Software analysis and testingartificial intelligencesmartphonesprivacy