🤖 AI Summary
In continual vision-instruction tuning (CVIT), streaming multimodal data induces significant training latency, and existing reference-free online sample selection methods—constrained by fixed sampling budgets (e.g., top-k)—fail to adapt to inter-batch variations in information content and distributional shifts.
Method: We propose a dynamic inter-batch informativeness-aware mechanism and an iterative redundancy-aware scoring update strategy. Our approach employs relative information gain for adaptive sampling and gradient-sensitivity-based reweighting to perform reference-free online importance estimation.
Contribution/Results: By breaking the fixed-sampling constraint, our method substantially enhances selection robustness under distribution drift. Evaluated on mainstream multimodal large language models (MLLMs), including LLaVA-1.5 and Qwen-VL-2.5, it achieves full-data performance using only 25% of the training samples—outperforming all existing state-of-the-art methods across benchmarks.
📝 Abstract
In continual visual instruction tuning (CVIT) scenarios, where multi-modal data continuously arrive in an online streaming manner, training delays from large-scale data significantly hinder real-time adaptation. While existing data selection strategies reduce training overheads, they rely on pre-trained reference models, which are impractical in CVIT setups due to unknown future data. Recent reference model-free online sample selection methods address this issue but typically select a fixed number of samples per batch (e.g., top-k), causing them to suffer from distribution shifts where informativeness varies across batches. To address these limitations, we propose OASIS, an adaptive online sample selection approach for CVIT that: (1) dynamically adjusts selected samples per batch based on relative inter-batch informativeness, and (2) minimizes redundancy of selected samples through iterative selection score updates. Empirical results across various MLLMs, such as LLaVA-1.5 and Qwen-VL-2.5, show that OASIS achieves comparable performance to full-data training using only 25% of the data and outperforms the state-of-the-art.