Filter, Correlate, Compress: Training-Free Token Reduction for MLLM Acceleration

📅 2024-11-26

📈 Citations: 5

✨ Influential: 0

career value

210K/year

🤖 AI Summary

Multimodal large language models (MLLMs) suffer from quadratic computational and memory overhead due to long visual token sequences, severely hindering deployment efficiency. To address the limitations of existing training-free compression methods—particularly in redundant token identification and critical information recovery—we propose a decoupled three-stage “Filter–Compensate–Compress” (FCC) framework. First, redundant tokens are filtered based on visual feature similarity. Second, cross-token semantic correlations are modeled to adaptively compensate discarded information into retained tokens. Third, weighted attention-guided fusion mitigates semantic dilution during compression. The framework requires no fine-tuning or gradient updates and supports dual-path adaptation—FiCoCo-V for vision encoders and FiCoCo-L for language decoders. Evaluated on LLaVA-1.5-7B and NeXT-7B, FCC achieves up to 5.7× and 14.7× FLOPs reduction while preserving 92.8% and 93.6% of original performance, respectively—substantially outperforming state-of-the-art training-free approaches.

Technology Category

Application Category

📝 Abstract

The quadratic complexity of Multimodal Large Language Models (MLLMs) with respect to sequence length poses significant computational and memory challenges, hindering their real-world deployment. While existing training-free token reduction methods aim to address these inefficiencies, how to precisely identify redundant visual tokens and recover the essential information from the discarded tokens remain unclear. In this paper, we propose a ''filter-correlate-compress'' framework that decomposes the token reduction into three stages: filtering redundant tokens, correlating discarded information to preserved tokens, and compressing tokens to minimize redundancy. Following the framework, we propose a solution FiCoCo to identify limitations in single redundancy assessment, propose adaptive strategies to retain critical information from discarded tokens, and mitigate semantic dilution during token fusion. Two specialized variants, FiCoCo-V (for vision encoders) and FiCoCo-L (for LLM decoders), further optimize efficiency across MLLM architectures. Extensive experiments demonstrate that FiCoCo achieves up to 5.7x/14.7x FLOPs reduction with 92.8%/93.6% performance retention on LLaVA-1.5-7B/LLaVA-NeXT-7B. Our methods consistently outperform state-of-the-art training-free approaches, showcasing effectiveness and generalizability across model architectures, sizes, and tasks without requiring retraining. Our project page is at https://ficoco-accelerate.github.io/.

Problem

Research questions and friction points this paper is trying to address.

Reduces computational and memory challenges in MLLMs.

Identifies and recovers essential information from discarded tokens.

Optimizes token reduction without requiring model retraining.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Filter-correlate-compress framework for token reduction

FiCoCo solution with adaptive information retention

Specialized variants FiCoCo-V and FiCoCo-L for efficiency

🔎 Similar Papers

Position IDs Matter: An Enhanced Position Layout for Efficient Context Compression in Large Language Models