Data Selection for Fine-tuning Vision Language Models via Cross Modal Alignment Trajectories

📅 2025-10-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address high data redundancy and suboptimal subset selection in instruction tuning of large vision-language models (LVLMs), this paper proposes a cross-modal attention alignment trajectory-based data filtering framework. The core innovation lies in the first formal definition and utilization of attention matrix gradient similarity as a redundancy metric—revealing that similar attention patterns correlate with redundant information—and the subsequent design of a singular value trajectory-guided clustering and balanced sampling strategy. Evaluated on LLaVA-665k and Vision-Flan, the method achieves 50% and 85% data compression rates, respectively, while fully preserving the performance of LLaVA-1.5-7B across ten downstream tasks. Training throughput increases by 1.2×, and the data compression rate improves by 30% over the strongest baseline.

Technology Category

Application Category

📝 Abstract
Data-efficient learning aims to eliminate redundancy in large training datasets by training models on smaller subsets of the most informative examples. While data selection has been extensively explored for vision models and large language models (LLMs), it remains underexplored for Large Vision-Language Models (LVLMs). Notably, none of existing methods can outperform random selection at different subset sizes. In this work, we propose the first principled method for data-efficient instruction tuning of LVLMs. We prove that examples with similar cross-modal attention matrices during instruction tuning have similar gradients. Thus, they influence model parameters in a similar manner and convey the same information to the model during training. Building on this insight, we propose XMAS, which clusters examples based on the trajectories of the top singular values of their attention matrices obtained from fine-tuning a small proxy LVLM. By sampling a balanced subset from these clusters, XMAS effectively removes redundancy in large-scale LVLM training data. Extensive experiments show that XMAS can discard 50% of the LLaVA-665k dataset and 85% of the Vision-Flan dataset while fully preserving performance of LLaVA-1.5-7B on 10 downstream benchmarks and speeding up its training by 1.2x. This is 30% more data reduction compared to the best baseline for LLaVA-665k. The project's website can be found at https://bigml-cs-ucla.github.io/XMAS-project-page/.
Problem

Research questions and friction points this paper is trying to address.

Developing data selection methods for vision-language models to eliminate redundancy
Addressing inefficiency in large-scale training by selecting informative data subsets
Improving model performance and training speed through cross-modal alignment trajectories
Innovation

Methods, ideas, or system contributions that make the work stand out.

Clusters examples using cross-modal attention trajectories
Selects balanced subset from clusters to remove redundancy
Achieves higher data reduction than existing baseline methods
🔎 Similar Papers
No similar papers found.
N
Nilay Naharas
Department of Computer Science, University of California Los Angeles
D
Dang Nguyen
Department of Computer Science, University of California Los Angeles
N
Nesihan Bulut
Google Research
M
Mohammadhossein Bateni
Google Research
Vahab Mirrokni
Vahab Mirrokni
Google Fellow, VP, Google Research
AlgorithmsMarket DesignGenAI AlgorithmsML ScalabilityGraph Algorithms
Baharan Mirzasoleiman
Baharan Mirzasoleiman
UCLA
Machine LearningOptimizationSubmodularityML SustainabilityData-quality