🤖 AI Summary
This work addresses the severe inefficiency caused by redundant visual token computation in multimodal large language models (MLLMs), where existing pruning methods struggle to balance performance and compression due to insufficient shallow-layer understanding and rigid scheduling. To overcome these limitations, we propose a Late Injection mechanism that precisely identifies the onset layer of cross-modal fusion, coupled with a concave pyramid pruning strategy that integrates differentiable top-k operations and Early Exit to dynamically adjust pruning ratios in intermediate and deep layers. By introducing persistent positional encoding and a parallel decoupled architecture, our approach eliminates hidden overhead from dynamic pruning and enables efficient, FlashAttention-compatible token selection. Our method achieves state-of-the-art efficiency in MLLM training and inference, compressing approximately 90% of visual tokens while preserving original performance and accelerating training by 1.72×.
📝 Abstract
The quadratic computational cost of processing vision tokens in Multimodal Large Language Models (MLLMs) hinders their widespread adoption. While progressive vision token pruning offers a promising solution, current methods misinterpret shallow layer functions and use rigid schedules, which fail to unlock the full efficiency potential. To address these issues, we propose HiDrop, a framework that aligns token pruning with the true hierarchical function of MLLM layers. HiDrop features two key innovations: (1) Late Injection, which bypasses passive shallow layers to introduce visual tokens exactly where active fusion begins; and (2) Concave Pyramid Pruning with an Early Exit mechanism to dynamically adjust pruning rates across middle and deep layers. This process is optimized via an inter-layer similarity measure and a differentiable top-k operator. To ensure practical efficiency, HiDrop further incorporates persistent positional encoding, FlashAttention-compatible token selection, and parallel decoupling of vision computation to eliminate hidden overhead associated with dynamic token reduction. Extensive experiments show that HiDrop compresses about 90% visual tokens while matching the original performance and accelerating training by 1.72 times. Our work not only sets a new state-of-the-art for efficient MLLM training and inference but also provides valuable insights into the hierarchical nature of multimodal fusion. The code is released at https://github.com/EIT-NLP/HiDrop.