COMPACT: COMPositional Atomic-to-Complex Visual Capability Tuning

📅 2025-04-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Multimodal large language models (MLLMs) exhibit limited performance on composite vision-language tasks requiring joint object recognition, counting, and spatial reasoning, primarily because conventional visual instruction tuning (VIT) overemphasizes data-scale expansion while neglecting the combinatorial complexity of capability integration. Method: We propose a novel VIT paradigm—“combinatorial capability complexity control”—featuring (i) an atomic capability decoupling annotation schema, (ii) a capability composition graph to guide progressive synthetic data generation, and (iii) a lightweight fine-tuning framework. Contribution/Results: Using less than 10% of the LLaVA-665k dataset, our method achieves +83.3% and +94.0% absolute gains on MMStar and MM-Vet benchmarks for questions demanding ≥4 atomic capabilities, significantly outperforming data-driven baselines. This work pioneers explicit structural modeling of capability composition in VIT, offering a principled and efficient pathway to unlock MLLMs’ compositional reasoning abilities.

Technology Category

Application Category

📝 Abstract
Multimodal Large Language Models (MLLMs) excel at simple vision-language tasks but struggle when faced with complex tasks that require multiple capabilities, such as simultaneously recognizing objects, counting them, and understanding their spatial relationships. This might be partially the result of the fact that Visual Instruction Tuning (VIT), a critical training step for MLLMs, has traditionally focused on scaling data volume, but not the compositional complexity of training examples. We propose COMPACT (COMPositional Atomic-to-complex visual Capability Tuning), which generates a training dataset explicitly controlling for the compositional complexity of the training examples. The data from COMPACT allows MLLMs to train on combinations of atomic capabilities to learn complex capabilities more efficiently. Across all benchmarks, COMPACT achieves comparable performance to the LLaVA-665k VIT while using less than 10% of its data budget, and even outperforms it on several, especially those involving complex multi-capability tasks. For example, COMPACT achieves substantial 83.3% improvement on MMStar and 94.0% improvement on MM-Vet compared to the full-scale VIT on particularly complex questions that require four or more atomic capabilities. COMPACT offers a scalable, data-efficient, visual compositional tuning recipe to improve on complex visual-language tasks.
Problem

Research questions and friction points this paper is trying to address.

Enhancing MLLMs' ability to handle complex multi-capability visual-language tasks
Addressing limitations of traditional Visual Instruction Tuning by controlling compositional complexity
Improving data efficiency in training for complex visual-language tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates dataset controlling compositional complexity
Trains MLLMs on atomic capability combinations
Achieves high performance with less data