🤖 AI Summary
Multimodal large language models (MLLMs) incur substantial computational overhead from visual token processing, and existing compression methods overlook the increased training difficulty caused by feature-space perturbations introduced during compression.
Method: We propose a progressive consistency distillation framework that, for the first time, decouples feature-space perturbations into token-level and layer-level components. It introduces a dual-path consistency distillation mechanism—enforcing both token-wise and inter-layer feature alignment between teacher and student models—alongside a progressive training strategy to mitigate optimization challenges induced by aggressive compression.
Contribution/Results: Our method significantly reduces visual token computation while preserving model performance, outperforming state-of-the-art compression approaches across multiple benchmarks. It demonstrates superior robustness and generalization capability, offering an effective and principled solution for efficient MLLM deployment.
📝 Abstract
Visual tokens consume substantial computational resources in multi-modal large models (MLLMs), significantly compromising their efficiency. Recent works have attempted to improve efficiency by compressing visual tokens during training, either through modifications to model components or by introducing additional parameters. However, they often overlook the increased learning difficulty caused by such compression, as the model's parameter space struggles to quickly adapt to the substantial perturbations in the feature space induced by token compression. In this work, we propose to develop Efficient MLLMs via Progressive Consistency Distillation (EPIC), a progressive learning framework. Specifically, by decomposing the feature space perturbations introduced by token compression along the token-wise and layer-wise dimensions, we introduce token consistency distillation and layer consistency distillation, respectively, aiming to reduce the training difficulty by leveraging guidance from a teacher model and following a progressive learning trajectory. Extensive experiments demonstrate the superior effectiveness, robustness, and generalization capabilities of our proposed framework.