HAWAII: Hierarchical Visual Knowledge Transfer for Efficient Vision-Language Models

๐Ÿ“… 2025-06-23
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address the weak visual understanding capability of vision-language models (VLMs) and the high computational cost of multi-expert knowledge distillation, this paper proposes a hierarchical knowledge distillation framework. Methodologically: (1) it introduces teacher-specific LoRA adapters and a learnable router to enable conflict-free, adaptive fusion of multi-expert features; (2) it establishes a dual-level distillation mechanismโ€”fine-grained (token-level importance weighting) and coarse-grained (global semantic summarization)โ€”to dynamically emphasize discriminative visual information. Evaluated on multiple VLM benchmarks, the method significantly outperforms mainstream open-source models, with only a 0.3% increase in inference parameter count and negligible change in FLOPs, thus achieving both high performance and efficiency. The core contribution lies in the first joint modeling of routing-based multi-expert distillation and hierarchical knowledge transfer, establishing a novel paradigm for lightweight VLM visual enhancement.

Technology Category

Application Category

๐Ÿ“ Abstract
Improving the visual understanding ability of vision-language models (VLMs) is crucial for enhancing their performance across various tasks. While using multiple pretrained visual experts has shown great promise, it often incurs significant computational costs during training and inference. To address this challenge, we propose HAWAII, a novel framework that distills knowledge from multiple visual experts into a single vision encoder, enabling it to inherit the complementary strengths of several experts with minimal computational overhead. To mitigate conflicts among different teachers and switch between different teacher-specific knowledge, instead of using a fixed set of adapters for multiple teachers, we propose to use teacher-specific Low-Rank Adaptation (LoRA) adapters with a corresponding router. Each adapter is aligned with a specific teacher, avoiding noisy guidance during distillation. To enable efficient knowledge distillation, we propose fine-grained and coarse-grained distillation. At the fine-grained level, token importance scores are employed to emphasize the most informative tokens from each teacher adaptively. At the coarse-grained level, we summarize the knowledge from multiple teachers and transfer it to the student using a set of general-knowledge LoRA adapters with a router. Extensive experiments on various vision-language tasks demonstrate the superiority of HAWAII, compared to the popular open-source VLMs.
Problem

Research questions and friction points this paper is trying to address.

Reduce computational costs in vision-language models
Mitigate conflicts among multiple visual experts
Enable efficient knowledge distillation from experts
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical knowledge transfer from multiple experts
Teacher-specific LoRA adapters with dynamic routing
Fine-grained and coarse-grained token distillation
๐Ÿ”Ž Similar Papers
No similar papers found.