HiDrop: Hierarchical Vision Token Reduction in MLLMs via Late Injection, Concave Pyramid Pruning, and Early Exit

📅 2026-02-27

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

This work addresses the severe inefficiency caused by redundant visual token computation in multimodal large language models (MLLMs), where existing pruning methods struggle to balance performance and compression due to insufficient shallow-layer understanding and rigid scheduling. To overcome these limitations, we propose a Late Injection mechanism that precisely identifies the onset layer of cross-modal fusion, coupled with a concave pyramid pruning strategy that integrates differentiable top-k operations and Early Exit to dynamically adjust pruning ratios in intermediate and deep layers. By introducing persistent positional encoding and a parallel decoupled architecture, our approach eliminates hidden overhead from dynamic pruning and enables efficient, FlashAttention-compatible token selection. Our method achieves state-of-the-art efficiency in MLLM training and inference, compressing approximately 90% of visual tokens while preserving original performance and accelerating training by 1.72×.

Technology Category

Application Category

📝 Abstract

The quadratic computational cost of processing vision tokens in Multimodal Large Language Models (MLLMs) hinders their widespread adoption. While progressive vision token pruning offers a promising solution, current methods misinterpret shallow layer functions and use rigid schedules, which fail to unlock the full efficiency potential. To address these issues, we propose HiDrop, a framework that aligns token pruning with the true hierarchical function of MLLM layers. HiDrop features two key innovations: (1) Late Injection, which bypasses passive shallow layers to introduce visual tokens exactly where active fusion begins; and (2) Concave Pyramid Pruning with an Early Exit mechanism to dynamically adjust pruning rates across middle and deep layers. This process is optimized via an inter-layer similarity measure and a differentiable top-k operator. To ensure practical efficiency, HiDrop further incorporates persistent positional encoding, FlashAttention-compatible token selection, and parallel decoupling of vision computation to eliminate hidden overhead associated with dynamic token reduction. Extensive experiments show that HiDrop compresses about 90% visual tokens while matching the original performance and accelerating training by 1.72 times. Our work not only sets a new state-of-the-art for efficient MLLM training and inference but also provides valuable insights into the hierarchical nature of multimodal fusion. The code is released at https://github.com/EIT-NLP/HiDrop.

Problem

Research questions and friction points this paper is trying to address.

vision token reduction

multimodal large language models

computational efficiency

hierarchical fusion

token pruning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Late Injection

Concave Pyramid Pruning

Early Exit