Multimodal LLMs Do Not Compose Skills Optimally Across Modalities

📅 2025-11-11

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

This work identifies a systematic deficiency in multimodal large language models (MLLMs): their inability to effectively compose cross-modal skills—specifically, to sequentially coordinate vision and language capabilities for composite tasks requiring structured multimodal skill dependencies. Method: The authors introduce three benchmark task categories with explicit skill dependency graphs, compare direct prompting against two-stage cascaded inference, and propose a hybrid approach combining chain-of-thought prompting with lightweight, skill-composition-oriented fine-tuning. Contribution/Results: Experiments show that while the proposed method significantly improves performance, all mainstream MLLMs exhibit a persistent and non-negligible compositional generalization gap. This study is the first to systematically characterize the cross-modal skill composition bottleneck in MLLMs, establishing a novel, structured evaluation benchmark and a reproducible methodology for assessing and enhancing the compositional reasoning capabilities of multimodal models.

Technology Category

Application Category

📝 Abstract

Skill composition is the ability to combine previously learned skills to solve new tasks. As neural networks acquire increasingly complex skills during their pretraining, it is not clear how successfully they can compose them. In this paper, we focus on Multimodal Large Language Models (MLLM), and study their ability to compose skills across modalities. To this end, we design three evaluation tasks which can be solved sequentially composing two modality-dependent skills, and evaluate several open MLLMs under two main settings: i) prompting the model to directly solve the task, and ii) using a two-step cascaded inference approach, which manually enforces the composition of the two skills for a given task. Even with these straightforward compositions, we find that all evaluated MLLMs exhibit a significant cross-modality skill composition gap. To mitigate the aforementioned gap, we explore two alternatives: i) use chain-of-thought prompting to explicitly instruct MLLMs for skill composition and ii) a specific fine-tuning recipe to promote skill composition. Although those strategies improve model performance, they still exhibit significant skill composition gaps, suggesting that more research is needed to improve cross-modal skill composition in MLLMs.

Problem

Research questions and friction points this paper is trying to address.

Evaluating multimodal LLMs' ability to compose skills across different modalities

Identifying significant cross-modality skill composition gaps in current MLLMs

Exploring methods to improve skill composition in multimodal language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cascaded inference for skill composition

Chain-of-thought prompting for composition

Fine-tuning recipe to promote composition

🔎 Similar Papers

No similar papers found.