METEOR: Multi-Encoder Collaborative Token Pruning for Efficient Vision Language Models

📅 2025-07-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address excessive computational overhead in multi-encoder vision-language models caused by redundant visual tokens, this paper proposes a multi-stage collaborative pruning framework that progressively prunes visual tokens during encoding, fusion, and decoding. It is the first to introduce multi-stage token pruning into multi-encoder architectures, featuring a ranking-guided collaborative token allocation mechanism and a cross-encoder redundancy elimination strategy. The framework further incorporates task-adaptive dynamic pruning ratio adjustment, low-rank modeling, and dynamic sparsification. Evaluated on 11 mainstream multimodal understanding benchmarks, our method reduces visual tokens by 76% compared to EAGLE while incurring only a marginal average performance drop of 0.3%. This yields substantial inference speedup without compromising accuracy, effectively balancing model efficiency and computational cost.

Technology Category

Application Category

📝 Abstract
Vision encoders serve as the cornerstone of multimodal understanding. Single-encoder architectures like CLIP exhibit inherent constraints in generalizing across diverse multimodal tasks, while recent multi-encoder fusion methods introduce prohibitive computational overhead to achieve superior performance using complementary visual representations from multiple vision encoders. To address this, we propose a progressive pruning framework, namely Multi-Encoder collaboraTivE tOken pRuning (METEOR), that eliminates redundant visual tokens across the encoding, fusion, and decoding stages for multi-encoder MLLMs. For multi-vision encoding, we discard redundant tokens within each encoder via a rank guided collaborative token assignment strategy. Subsequently, for multi-vision fusion, we combine the visual features from different encoders while reducing cross-encoder redundancy with cooperative pruning. Finally, we propose an adaptive token pruning method in the LLM decoding stage to further discard irrelevant tokens based on the text prompts with dynamically adjusting pruning ratios for specific task demands. To our best knowledge, this is the first successful attempt that achieves an efficient multi-encoder based vision language model with multi-stage pruning strategies. Extensive experiments on 11 benchmarks demonstrate the effectiveness of our proposed approach. Compared with EAGLE, a typical multi-encoder MLLMs, METEOR reduces 76% visual tokens with only 0.3% performance drop in average. The code is available at https://github.com/YuchenLiu98/METEOR.
Problem

Research questions and friction points this paper is trying to address.

Reduces redundant visual tokens in multi-encoder models
Improves efficiency without significant performance drop
Optimizes token pruning across encoding, fusion, decoding stages
Innovation

Methods, ideas, or system contributions that make the work stand out.

Progressive pruning framework for multi-encoder MLLMs
Rank guided collaborative token assignment strategy
Adaptive token pruning with dynamic ratios
🔎 Similar Papers
No similar papers found.
Y
Yuchen Liu
Shanghai Jiao Tong University, China
Y
Yaoming Wang
Meituan Inc., China
B
Bowen Shi
Shanghai Jiao Tong University, China
X
Xiaopeng Zhang
Huawei Inc., China
Wenrui Dai
Wenrui Dai
Shanghai Jiao Tong University
Predictive ModelingImage/Video CodingSignal Processing
C
Chenglin Li
Shanghai Jiao Tong University, China
Hongkai Xiong
Hongkai Xiong
Distinguished Professor, Shanghai Jiao Tong University
Image and video codingsignal processingmultimedia communicationvision and learning
Q
Qi Tian
Huawei Inc., China