Streamline Without Sacrifice -- Squeeze out Computation Redundancy in LMM

📅 2025-05-21

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

Large multimodal models suffer from high inference overhead due to excessive computation over visual tokens. This work is the first to systematically identify and quantify computational redundancy of visual tokens within the decoder. To address this, we propose ProxyV: a lightweight proxy-token mechanism operating at the *computation layer*—not the token layer—that replaces redundant forward passes of original visual tokens while preserving semantic fidelity. Our approach integrates pretrained encoder analysis, module-level ablation studies, and principled proxy-token architecture design, and is compatible with complementary compression techniques such as token pruning. Experiments demonstrate that ProxyV reduces visual-side FLOPs by 42% on average, accelerates inference by 40%, and maintains—or even exceeds—baseline performance across multimodal understanding tasks, all while seamlessly integrating with existing model compression paradigms.

Technology Category

Application Category

📝 Abstract

Large multimodal models excel in multimodal tasks but face significant computational challenges due to excessive computation on visual tokens. Unlike token reduction methods that focus on token-level redundancy, we identify and study the computation-level redundancy on vision tokens to ensure no information loss. Our key insight is that vision tokens from the pretrained vision encoder do not necessarily require all the heavy operations (e.g., self-attention, FFNs) in decoder-only LMMs and could be processed more lightly with proper designs. We designed a series of experiments to discover and progressively squeeze out the vision-related computation redundancy. Based on our findings, we propose ProxyV, a novel approach that utilizes proxy vision tokens to alleviate the computational burden on original vision tokens. ProxyV enhances efficiency without compromising performance and can even yield notable performance gains in scenarios with more moderate efficiency improvements. Furthermore, the flexibility of ProxyV is demonstrated through its combination with token reduction methods to boost efficiency further. The code will be made public at this https://github.com/penghao-wu/ProxyV URL.

Problem

Research questions and friction points this paper is trying to address.

Reduce computation redundancy in large multimodal models

Optimize vision token processing without information loss

Alleviate computational burden using proxy vision tokens

Innovation

Methods, ideas, or system contributions that make the work stand out.

Identify computation-level redundancy in vision tokens

Use proxy vision tokens to reduce computation

Combine with token reduction for higher efficiency

🔎 Similar Papers

Why Lift so Heavy? Slimming Large Language Models by Cutting Off the Layers