🤖 AI Summary
This work addresses the inherent trade-off between visual concept learning and visual skill acquisition in vision-language instruction tuning. We propose a task-characteristic-driven instruction filtering framework that systematically analyzes the reliance of over ten mainstream multimodal benchmarks on either concept-oriented or skill-oriented instructions. Based on this analysis, we construct a dual-dimensional task representation—spanning visual concepts and operational skills—and design a matching-degree evaluation strategy to enable targeted selection and optimization of training instructions. Our study is the first to empirically reveal significant divergence across benchmarks in their preference for concepts versus skills. Leveraging this insight, we introduce a classification-based instruction filtering mechanism that achieves an average improvement of 0.9% over the strongest baseline across the full benchmark suite, and up to 1.5% gains on skill-intensive subsets—demonstrating both effectiveness and generalizability.
📝 Abstract
Vision-language instruction tuning achieves two main purposes: learning visual concepts and learning visual skills. In this paper, we found that vision-language benchmarks fall into the dichotomy of mainly benefiting from training on instructions with similar skills or visual concepts. Inspired by the discovery, we designed a simple targeted training data selection method to optimize the performance of a given benchmark. We first extract the concepts/skills from the benchmark, determine whether the benchmark predominantly benefits from similar concepts or skills, and finally select instructions with the most matching concepts/skills. Experiments on 10+ benchmarks validate the effectiveness of our targeted data selection method, showing +0.9% over the best existing baseline averaged over all benchmarks and +1.5% on the skill-focused subset. Our findings underscore the importance of recognizing the inherent trade-off within instruction selection, which requires balancing the acquisition of conceptual knowledge against visual skill.