Beyond Next-Token Alignment: Distilling Multimodal Large Language Models via Token Interactions

📅 2026-02-10

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

This work addresses the limitation of existing knowledge distillation methods for compressing multimodal large language models, which rely solely on static next-token alignment and overlook the dynamic token interactions essential for multimodal understanding and generation. To this end, we propose Align-TI, a novel distillation framework that introduces token interaction mechanisms into multimodal distillation for the first time. Align-TI incorporates two key modules: Visual-Instruction Alignment (IVA) and Token-wise Progression Alignment (TPA), which explicitly model visual information extraction and generative reasoning dynamics, respectively, enabling fine-grained knowledge transfer. Experimental results demonstrate that Align-TI outperforms baseline distillation approaches by 2.6% and yields a distilled 2B model that surpasses LLaVA-1.5-7B by 7.0%, achieving state-of-the-art distillation performance.

Technology Category

Application Category

📝 Abstract

Multimodal Large Language Models (MLLMs) demonstrate impressive cross-modal capabilities, yet their substantial size poses significant deployment challenges. Knowledge distillation (KD) is a promising solution for compressing these models, but existing methods primarily rely on static next-token alignment, neglecting the dynamic token interactions, which embed essential capabilities for multimodal understanding and generation. To this end, we introduce Align-TI, a novel KD framework designed from the perspective of Token Interactions. Our approach is motivated by the insight that MLLMs rely on two primary interactions: vision-instruction token interactions to extract relevant visual information, and intra-response token interactions for coherent generation. Accordingly, Align-TI introduces two components: IVA enables the student model to imitate the teacher's instruction-relevant visual information extract capability by aligning on salient visual regions. TPA captures the teacher's dynamic generative logic by aligning the sequential token-to-token transition probabilities. Extensive experiments demonstrate Align-TI's superiority. Notably, our approach achieves $2.6\%$ relative improvement over Vanilla KD, and our distilled Align-TI-2B even outperforms LLaVA-1.5-7B (a much larger MLLM) by $7.0\%$, establishing a new state-of-the-art distillation framework for training parameter-efficient MLLMs. Code is available at https://github.com/lchen1019/Align-TI.

Problem

Research questions and friction points this paper is trying to address.

Multimodal Large Language Models

Knowledge Distillation

Token Interactions

Model Compression

Next-Token Alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Token Interactions

Knowledge Distillation

Multimodal Large Language Models