Efficient Multi-modal Large Language Models via Progressive Consistency Distillation

📅 2025-10-01

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

Multimodal large language models (MLLMs) incur substantial computational overhead from visual token processing, and existing compression methods overlook the increased training difficulty caused by feature-space perturbations introduced during compression. Method: We propose a progressive consistency distillation framework that, for the first time, decouples feature-space perturbations into token-level and layer-level components. It introduces a dual-path consistency distillation mechanism—enforcing both token-wise and inter-layer feature alignment between teacher and student models—alongside a progressive training strategy to mitigate optimization challenges induced by aggressive compression. Contribution/Results: Our method significantly reduces visual token computation while preserving model performance, outperforming state-of-the-art compression approaches across multiple benchmarks. It demonstrates superior robustness and generalization capability, offering an effective and principled solution for efficient MLLM deployment.

Technology Category

Application Category

📝 Abstract

Visual tokens consume substantial computational resources in multi-modal large models (MLLMs), significantly compromising their efficiency. Recent works have attempted to improve efficiency by compressing visual tokens during training, either through modifications to model components or by introducing additional parameters. However, they often overlook the increased learning difficulty caused by such compression, as the model's parameter space struggles to quickly adapt to the substantial perturbations in the feature space induced by token compression. In this work, we propose to develop Efficient MLLMs via Progressive Consistency Distillation (EPIC), a progressive learning framework. Specifically, by decomposing the feature space perturbations introduced by token compression along the token-wise and layer-wise dimensions, we introduce token consistency distillation and layer consistency distillation, respectively, aiming to reduce the training difficulty by leveraging guidance from a teacher model and following a progressive learning trajectory. Extensive experiments demonstrate the superior effectiveness, robustness, and generalization capabilities of our proposed framework.

Problem

Research questions and friction points this paper is trying to address.

Reduces computational cost of visual tokens in MLLMs

Addresses learning difficulty from token compression perturbations

Improves efficiency via progressive consistency distillation framework

Innovation

Methods, ideas, or system contributions that make the work stand out.

Progressive consistency distillation reduces training difficulty

Token-wise and layer-wise distillation guides model learning

Teacher model provides guidance for progressive learning trajectory

🔎 Similar Papers

No similar papers found.