HiDrop: Hierarchical Vision Token Reduction in MLLMs via Late Injection, Concave Pyramid Pruning, and Early Exit

📅 2026-02-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the severe inefficiency caused by redundant visual token computation in multimodal large language models (MLLMs), where existing pruning methods struggle to balance performance and compression due to insufficient shallow-layer understanding and rigid scheduling. To overcome these limitations, we propose a Late Injection mechanism that precisely identifies the onset layer of cross-modal fusion, coupled with a concave pyramid pruning strategy that integrates differentiable top-k operations and Early Exit to dynamically adjust pruning ratios in intermediate and deep layers. By introducing persistent positional encoding and a parallel decoupled architecture, our approach eliminates hidden overhead from dynamic pruning and enables efficient, FlashAttention-compatible token selection. Our method achieves state-of-the-art efficiency in MLLM training and inference, compressing approximately 90% of visual tokens while preserving original performance and accelerating training by 1.72×.

Technology Category

Application Category

📝 Abstract
The quadratic computational cost of processing vision tokens in Multimodal Large Language Models (MLLMs) hinders their widespread adoption. While progressive vision token pruning offers a promising solution, current methods misinterpret shallow layer functions and use rigid schedules, which fail to unlock the full efficiency potential. To address these issues, we propose HiDrop, a framework that aligns token pruning with the true hierarchical function of MLLM layers. HiDrop features two key innovations: (1) Late Injection, which bypasses passive shallow layers to introduce visual tokens exactly where active fusion begins; and (2) Concave Pyramid Pruning with an Early Exit mechanism to dynamically adjust pruning rates across middle and deep layers. This process is optimized via an inter-layer similarity measure and a differentiable top-k operator. To ensure practical efficiency, HiDrop further incorporates persistent positional encoding, FlashAttention-compatible token selection, and parallel decoupling of vision computation to eliminate hidden overhead associated with dynamic token reduction. Extensive experiments show that HiDrop compresses about 90% visual tokens while matching the original performance and accelerating training by 1.72 times. Our work not only sets a new state-of-the-art for efficient MLLM training and inference but also provides valuable insights into the hierarchical nature of multimodal fusion. The code is released at https://github.com/EIT-NLP/HiDrop.
Problem

Research questions and friction points this paper is trying to address.

vision token reduction
multimodal large language models
computational efficiency
hierarchical fusion
token pruning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Late Injection
Concave Pyramid Pruning
Early Exit
Vision Token Reduction
Multimodal Large Language Models
🔎 Similar Papers
No similar papers found.
H
Hao Wu
Institute of Digital Twin, Eastern Institute of Technology, Ningbo; Ningbo Key Laboratory of Spatial Intelligence and Digital Derivative
Y
Yingqi Fan
Institute of Digital Twin, Eastern Institute of Technology, Ningbo
J
Jinyang Dai
University of Science and Technology of China
J
Junlong Tong
Institute of Digital Twin, Eastern Institute of Technology, Ningbo; Ningbo Key Laboratory of Spatial Intelligence and Digital Derivative; Shanghai Jiao Tong University
Yunpu Ma
Yunpu Ma
Ludwig Maximilian University of Munich
Foundation ModelsAgentic AITemporal Knowledge GraphQuantum AI
Xiaoyu Shen
Xiaoyu Shen
Eastern Institute of Technology, Ningbo
language modelmulti-modal learningreasoning