FlowCut: Rethinking Redundancy via Information Flow for Efficient Vision-Language Models

📅 2025-05-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large vision-language models (LVLMs) suffer from high computational overhead due to redundant visual tokens, and existing single-layer attention pruning methods fail to accurately identify cross-layer dynamic redundancy. Method: This work introduces information flow modeling into visual token pruning—first proposing a CLS-token-relayed cross-layer flow analysis paradigm to uncover the progressive emergence of redundancy; it then designs a flow-aware lightweight pruning framework for dynamic, hierarchical importance assessment of visual tokens. Contribution/Results: Evaluated on LLaVA-1.5-7B and LLaVA-NeXT-7B, our method improves performance by 1.6% and 4.3%, respectively, achieves visual token compression rates of 88.9% and 94.4%, and accelerates the prefill phase by 3.2×—significantly outperforming single-layer pruning baselines.

Technology Category

Application Category

📝 Abstract
Large vision-language models (LVLMs) excel at multimodal understanding but suffer from high computational costs due to redundant vision tokens. Existing pruning methods typically rely on single-layer attention scores to rank and prune redundant visual tokens to solve this inefficiency. However, as the interaction between tokens and layers is complicated, this raises a basic question: Is such a simple single-layer criterion sufficient to identify redundancy? To answer this question, we rethink the emergence of redundant visual tokens from a fundamental perspective: information flow, which models the interaction between tokens and layers by capturing how information moves between tokens across layers. We find (1) the CLS token acts as an information relay, which can simplify the complicated flow analysis; (2) the redundancy emerges progressively and dynamically via layer-wise attention concentration; and (3) relying solely on attention scores from single layers can lead to contradictory redundancy identification. Based on this, we propose FlowCut, an information-flow-aware pruning framework, mitigating the insufficiency of the current criterion for identifying redundant tokens and better aligning with the model's inherent behaviors. Extensive experiments show that FlowCut achieves superior results, outperforming SoTA by 1.6% on LLaVA-1.5-7B with 88.9% token reduction, and by 4.3% on LLaVA-NeXT-7B with 94.4% reduction, delivering 3.2x speed-up in the prefilling stage. Our code is available at https://github.com/TungChintao/FlowCut
Problem

Research questions and friction points this paper is trying to address.

Identifies redundancy in vision-language models via information flow
Challenges single-layer attention scores for token pruning
Proposes FlowCut for efficient token reduction and speed-up
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses information flow for token redundancy analysis
Introduces CLS token as information relay
Proposes FlowCut for efficient token pruning
🔎 Similar Papers
No similar papers found.
Jintao Tong
Jintao Tong
Huazhong University of Science and Technology
large multimodal modelfew-shot learning
W
Wenwei Jin
Xiaohongshu Inc.
Pengda Qin
Pengda Qin
Alibaba Group
Multimedia NetworkingSelf-Supervised LearningNatural Language Processing
A
Anqi Li
Institute of Information Science, Beijing Jiaotong University
Yixiong Zou
Yixiong Zou
Huazhong University of Science and Technology
Computer visionDomain generalizationFew-shot learningVision-language model
Y
Yuhong Li
Xiaohongshu Inc.
Y
Yuhua Li
School of Computer Science and Technology, Huazhong University of Science and Technology
R
Ruixuan Li
School of Computer Science and Technology, Huazhong University of Science and Technology