🤖 AI Summary
This work addresses the high computational cost of large vision-language models caused by the excessive number of visual tokens and the limited understanding of attention mechanisms in existing pruning methods. The authors reformulate attention as an implicit linear layer constructed from outer products of token key-value pairs, revealing for the first time—through a dual perspective—that token pruning is equivalent to low-rank approximation of the implicit weight matrix. Building on this insight, they introduce a novel metric that jointly evaluates token informativeness and redundancy, and propose a Progressive Chunk-wise Maximal Marginal Relevance (PC-MMR) algorithm to efficiently select an optimal subset of tokens. Experiments demonstrate that the method significantly outperforms current pruning strategies across diverse vision-language tasks, achieving substantial gains in inference efficiency while preserving model performance, thereby validating the effectiveness and generality of the proposed framework.
📝 Abstract
Large Vision Language Models show impressive performance across image and video understanding tasks, yet their computational cost grows rapidly with the number of visual tokens. Existing token pruning methods mitigate this issue through empirical approaches while overlooking the internal mechanism of attention. In this paper, we propose a novel training free token pruning framework grounded in the dual form perspective of attention. We reformulate attention as an implicit linear layer whose weight matrix is the sum of rank 1 outer products, each generated by a single token's key value pair. Token pruning thus reduces to selecting an optimal subset of these rank 1 updates that best approximates the original dual weight matrix. Extending this perspective to standard softmax attention in LVLMs, we derive a novel metric quantifying both a token's information magnitude and information duplication. To efficiently select the subset with the proposed metric, we introduce Progressive Chunked Maximal Marginal Relevance. Extensive experiments demonstrate that our method achieves a better trade off between performance and efficiency, while providing another perspective on existing pruning approaches.