CompoDistill: Attention Distillation for Compositional Reasoning in Multimodal LLMs

📅 2025-10-14

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

Existing knowledge distillation methods struggle to effectively transfer the visual perception capabilities of multimodal large language models (MLLMs) from teacher to student, particularly overlooking misalignment in visual attention mechanisms between teacher and student models. To address this, we propose CompoDistill—a novel framework that systematically identifies and resolves visual attention mismatch in MLLM distillation for the first time. It introduces an explicit attention alignment mechanism tailored for multimodal scenarios, jointly integrating feature-level alignment with multi-task collaborative training. Our approach significantly enhances the student model’s fine-grained visual understanding on compositional reasoning tasks, achieving state-of-the-art performance across multiple benchmarks. Importantly, it preserves strong performance on conventional visual question answering (VQA) tasks and demonstrates robust generalization to advanced vision backbones.

Technology Category

Application Category

📝 Abstract

Recently, efficient Multimodal Large Language Models (MLLMs) have gained significant attention as a solution to their high computational complexity, making them more practical for real-world applications. In this regard, the knowledge distillation (KD) approach has emerged as a promising alternative, which transfers the rich visual and linguistic knowledge from a larger model (teacher) to a smaller model (student). However, we observe that existing KD methods struggle to effectively distill the teacher MLLM's rich visual perception abilities to the student, a challenge that has been largely overlooked in previous studies. Through a systematic analysis, we identify visual attention misalignment between student and teacher as the main cause of this issue. Based on this insight, we propose CompoDistill, a novel KD framework that explicitly aligns the student's visual attention with that of the teacher to enhance the student's visual perception abilities. Our extensive experiments show that CompoDistill significantly improves performance on compositional reasoning tasks that require visual perception abilities while maintaining strong performance on visual question answering tasks, as done in existing studies. Furthermore, CompoDistill demonstrates effectiveness with a more advanced backbone, highlighting its generalizability.

Problem

Research questions and friction points this paper is trying to address.

Distilling visual perception abilities in multimodal LLMs

Addressing visual attention misalignment in knowledge distillation

Enhancing compositional reasoning through attention alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Aligns student-teacher visual attention via distillation

Enhances compositional reasoning in multimodal models

Improves visual perception while maintaining VQA performance

🔎 Similar Papers

Do Large Language Models Latently Perform Multi-Hop Reasoning?