Concepts or Skills? Rethinking Instruction Selection for Multi-modal Models

📅 2025-08-14

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

This work addresses the inherent trade-off between visual concept learning and visual skill acquisition in vision-language instruction tuning. We propose a task-characteristic-driven instruction filtering framework that systematically analyzes the reliance of over ten mainstream multimodal benchmarks on either concept-oriented or skill-oriented instructions. Based on this analysis, we construct a dual-dimensional task representation—spanning visual concepts and operational skills—and design a matching-degree evaluation strategy to enable targeted selection and optimization of training instructions. Our study is the first to empirically reveal significant divergence across benchmarks in their preference for concepts versus skills. Leveraging this insight, we introduce a classification-based instruction filtering mechanism that achieves an average improvement of 0.9% over the strongest baseline across the full benchmark suite, and up to 1.5% gains on skill-intensive subsets—demonstrating both effectiveness and generalizability.

Technology Category

Application Category

📝 Abstract

Vision-language instruction tuning achieves two main purposes: learning visual concepts and learning visual skills. In this paper, we found that vision-language benchmarks fall into the dichotomy of mainly benefiting from training on instructions with similar skills or visual concepts. Inspired by the discovery, we designed a simple targeted training data selection method to optimize the performance of a given benchmark. We first extract the concepts/skills from the benchmark, determine whether the benchmark predominantly benefits from similar concepts or skills, and finally select instructions with the most matching concepts/skills. Experiments on 10+ benchmarks validate the effectiveness of our targeted data selection method, showing +0.9% over the best existing baseline averaged over all benchmarks and +1.5% on the skill-focused subset. Our findings underscore the importance of recognizing the inherent trade-off within instruction selection, which requires balancing the acquisition of conceptual knowledge against visual skill.

Problem

Research questions and friction points this paper is trying to address.

Determine if benchmarks benefit from similar skills or concepts

Optimize performance via targeted training data selection

Balance conceptual knowledge acquisition with visual skills

Innovation

Methods, ideas, or system contributions that make the work stand out.

Targeted training data selection method

Extract concepts/skills from benchmarks

Select instructions with matching concepts/skills

🔎 Similar Papers

Multimodal Methods for Analyzing Learning and Training Environments: A Systematic Literature Review