MAGIC: Multimodal Alignment & Grounding-aware Instruction Coreset for Vision-Language Models

📅 2026-05-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses key limitations of existing large-scale vision-language instruction tuning datasets—namely high redundancy, weak visual dependency, and uneven coverage of multimodal reasoning skills—which hinder the effectiveness of conventional subsampling strategies. To overcome these challenges, the authors propose a training-free forward coreset selection method that, for the first time, integrates multimodal gain, answer–vision alignment sharpness, and skill neuron activation patterns. Their approach employs a three-stage pipeline (filtering, ranking, and bucketing) to construct compact yet behaviorally faithful subsets. Evaluated on LLaVA-665K and Vision-Flan, the method achieves 100.3%–101.6% of full-dataset fine-tuning performance using only 20% of the data, while reducing training time by 73.7%, thereby substantially improving data efficiency and preserving multimodal capabilities.
📝 Abstract
Instruction tuning of large vision-language models (LVLMs) increasingly depends on massive multimodal corpora, yet these datasets contain samples with substantial redundancy, low visual dependency, and highly imbalanced coverage of multimodal reasoning behaviors. As a result, uniform subsampling or naive score-based selection often yields suboptimal training subsets. We introduce MAGIC, a training-free, forward-only coreset selection method designed to construct compact yet behaviorally faithful subsets for multimodal instruction tuning. MAGIC is built on three intrinsic signals extracted from a pretrained VLM: Multimodal Gain, which measures the likelihood improvement obtained from visual input; Bridging Relevance, which captures the sharpness of answer-token grounding over visual tokens; and Skill-Neuron Signatures, which characterize the functional computation elicited by each sample via top-activated feed-forward neurons. MAGIC combines these signals in a three-stage pipeline: filtering low-gain examples, ranking candidates by a normalized quality objective, and performing bucket-wise budget allocation over discrete neuron signatures to preserve latent multimodal skill coverage. This formulation avoids backpropagation, auxiliary selector training, and expensive clustering in continuous activation spaces, while remaining efficient and easily deployable in existing VLMs. Across LLaVA-665K and Vision-Flan datasets, and transfer settings to large target models, LLaVA-1.5-7B and -13B, MAGIC consistently improves over strong baselines under matched 20% budgets: it achieves 100.3% relative performance to full finetuning on LLaVA-665K and 101.6% relative performance on Vision-Flan-186K, while yielding a 73.7% reduction in wall-clock run time.
Problem

Research questions and friction points this paper is trying to address.

instruction tuning
multimodal redundancy
visual dependency
reasoning behavior imbalance
coreset selection
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal Gain
Bridging Relevance
Skill-Neuron Signatures
Coreset Selection
Instruction Tuning