LinguDistill: Recovering Linguistic Ability in Vision- Language Models via Selective Cross-Modal Distillation

📅 2026-04-01

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

This work addresses the degradation of language capabilities in pretrained language models when adapted to multimodal settings—a phenomenon often caused by representation shift and cross-modal interference and typically resistant to recovery via standard fine-tuning. The authors propose a novel adapter-free distillation approach that freezes the original language model as a teacher and enables it to perceive the student’s multimodal representations through shared key-value (KV) caches across layers. By selectively distilling strong linguistic signals from language-intensive data while preserving visual task performance, the method restores the native language proficiency of vision-language models without introducing additional modules. This strategy avoids architectural complexity and inference overhead, recovers approximately 10% of lost performance on language and knowledge benchmarks, and maintains stable accuracy on visual tasks.

Technology Category

Application Category

📝 Abstract

Adapting pretrained language models (LMs) into vision-language models (VLMs) can degrade their native linguistic capability due to representation shift and cross-modal interference introduced during multimodal adaptation. Such loss is difficult to recover, even with targeted task-specific fine-tuning using standard objectives. Prior recovery approaches typically introduce additional modules that act as intermediate alignment layers to maintain or isolate modality-specific subspaces, which increases architectural complexity, adds parameters at inference time, and limits flexibility across models and settings. We propose LinguDistill, an adapter-free distillation method that restores linguistic capability by utilizing the original frozen LM as a teacher. We overcome the key challenge of enabling vision-conditioned teacher supervision by introducing layer-wise KV-cache sharing, which exposes the teacher to the student's multimodal representations without modifying the architecture of either model. We then selectively distill the teacher's strong linguistic signal on language-intensive data to recover language capability, while preserving the student's visual grounding on multimodal tasks. As a result, LinguDistill recovers $\sim$10% of the performance lost on language and knowledge benchmarks, while maintaining comparable performance on vision-heavy tasks. Our findings demonstrate that linguistic capability can be recovered without additional modules, providing an efficient and practical solution to modality-specific degradation in multimodal models.

Problem

Research questions and friction points this paper is trying to address.

linguistic degradation

vision-language models

cross-modal interference

representation shift

multimodal adaptation

Innovation

Methods, ideas, or system contributions that make the work stand out.

cross-modal distillation

linguistic capability recovery

KV-cache sharing