p-MoD: Building Mixture-of-Depths MLLMs via Progressive Ratio Decay

📅 2024-12-05

🏛️ arXiv.org

📈 Citations: 5

✨ Influential: 0

career value

197K/year

🤖 AI Summary

To address the high computational overhead and prohibitive inference/training costs of multimodal large language models (MLLMs), this paper proposes an efficient Mixture-of-Depths (MoD)-based architecture that dynamically selects critical visual tokens layer-wise to eliminate redundancy. Key contributions include: (1) Tanh-normalized gating (TanhNorm), which enhances gating stability via tanh-based weight normalization; (2) Symmetric Token Re-weighting (STRing), which strengthens representations of salient tokens through bidirectional re-weighting; and (3) Progressive Retention Decay (PRD), a layer-wise token retention schedule that gradually reduces retention rates to align with the hierarchical distribution of visual redundancy. Evaluated on 14 mainstream benchmarks, our method matches or surpasses baseline performance while reducing inference TFLOPs by 44.4%, KV cache usage by 46.2%, and training GPU-hours by 22.3%.

Technology Category

Application Category

📝 Abstract

Despite the remarkable performance of multimodal large language models (MLLMs) across diverse tasks, the substantial training and inference costs impede their advancement. The majority of computation stems from the overwhelming volume of vision tokens processed by the transformer decoder. In this paper, we propose to build efficient MLLMs by leveraging the Mixture-of-Depths (MoD) mechanism, where each transformer decoder layer selects essential vision tokens to process while skipping redundant ones. However, integrating MoD into MLLMs is non-trivial. To address the challenges of training and inference stability as well as limited training data, we adapt the MoD module with two novel designs: tanh-gated weight normalization (TanhNorm) and symmetric token reweighting (STRing). Moreover, we observe that vision tokens exhibit higher redundancy in deeper layer and thus design a progressive ratio decay (PRD) strategy, which gradually reduces the token retention ratio layer by layer, employing a shifted cosine schedule. This crucial design fully unleashes the potential of MoD, significantly boosting the efficiency and performance of our models. To validate the effectiveness of our approach, we conduct extensive experiments with two baseline models across 14 benchmarks. Our model, p-MoD, matches or even surpasses the performance of the baseline models, with only 55.6% TFLOPs and 53.8% KV cache storage during inference, and 77.7% GPU hours during training.

Problem

Research questions and friction points this paper is trying to address.

Reducing training and inference costs in MLLMs

Selecting essential vision tokens while skipping redundant ones

Improving efficiency and performance with progressive ratio decay

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture-of-Depths mechanism for token selection

TanhNorm and STRing for stable training

Progressive ratio decay for efficiency boost

🔎 Similar Papers

No similar papers found.