๐ค AI Summary
To address the weak visual understanding capability of vision-language models (VLMs) and the high computational cost of multi-expert knowledge distillation, this paper proposes a hierarchical knowledge distillation framework. Methodologically: (1) it introduces teacher-specific LoRA adapters and a learnable router to enable conflict-free, adaptive fusion of multi-expert features; (2) it establishes a dual-level distillation mechanismโfine-grained (token-level importance weighting) and coarse-grained (global semantic summarization)โto dynamically emphasize discriminative visual information. Evaluated on multiple VLM benchmarks, the method significantly outperforms mainstream open-source models, with only a 0.3% increase in inference parameter count and negligible change in FLOPs, thus achieving both high performance and efficiency. The core contribution lies in the first joint modeling of routing-based multi-expert distillation and hierarchical knowledge transfer, establishing a novel paradigm for lightweight VLM visual enhancement.
๐ Abstract
Improving the visual understanding ability of vision-language models (VLMs) is crucial for enhancing their performance across various tasks. While using multiple pretrained visual experts has shown great promise, it often incurs significant computational costs during training and inference. To address this challenge, we propose HAWAII, a novel framework that distills knowledge from multiple visual experts into a single vision encoder, enabling it to inherit the complementary strengths of several experts with minimal computational overhead. To mitigate conflicts among different teachers and switch between different teacher-specific knowledge, instead of using a fixed set of adapters for multiple teachers, we propose to use teacher-specific Low-Rank Adaptation (LoRA) adapters with a corresponding router. Each adapter is aligned with a specific teacher, avoiding noisy guidance during distillation. To enable efficient knowledge distillation, we propose fine-grained and coarse-grained distillation. At the fine-grained level, token importance scores are employed to emphasize the most informative tokens from each teacher adaptively. At the coarse-grained level, we summarize the knowledge from multiple teachers and transfer it to the student using a set of general-knowledge LoRA adapters with a router. Extensive experiments on various vision-language tasks demonstrate the superiority of HAWAII, compared to the popular open-source VLMs.