Representation Shift: Unifying Token Compression with FlashAttention

📅 2025-08-01

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

Existing self-attention mechanisms suffer from quadratic computational complexity and high memory overhead—especially in long-sequence tasks such as video processing. While FlashAttention reduces memory access costs, it discards the attention map, rendering it incompatible with training-free, attention-map-dependent token compression methods. This paper proposes a training-free, model-agnostic token compression framework. Its core innovation is the Representation Shift metric—a novel importance estimator that dynamically evaluates token significance without explicitly computing or storing the attention map. Integrated seamlessly with FlashAttention kernels, our method achieves the first attention-map-free, training-free efficient compression. Moreover, it generalizes beyond transformers to CNNs and state-space models. On video–text retrieval and video question answering benchmarks, it improves inference speed by 5.5% and 4.4%, respectively, significantly enhancing deployment efficiency for large multimodal models.

Technology Category

Application Category

📝 Abstract

Transformers have demonstrated remarkable success across vision, language, and video. Yet, increasing task complexity has led to larger models and more tokens, raising the quadratic cost of self-attention and the overhead of GPU memory access. To reduce the computation cost of self-attention, prior work has proposed token compression techniques that drop redundant or less informative tokens. Meanwhile, fused attention kernels such as FlashAttention have been developed to alleviate memory overhead by avoiding attention map construction and its associated I/O to HBM. This, however, makes it incompatible with most training-free token compression methods, which rely on attention maps to determine token importance. Here, we propose Representation Shift, a training-free, model-agnostic metric that measures the degree of change in each token's representation. This seamlessly integrates token compression with FlashAttention, without attention maps or retraining. Our method further generalizes beyond Transformers to CNNs and state space models. Extensive experiments show that Representation Shift enables effective token compression compatible with FlashAttention, yielding significant speedups of up to 5.5% and 4.4% in video-text retrieval and video QA, respectively. Code is available at https://github.com/mlvlab/Representation-Shift.

Problem

Research questions and friction points this paper is trying to address.

Reducing quadratic cost of self-attention in Transformers

Enabling token compression without attention maps or retraining

Integrating token compression with FlashAttention for efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free token compression method

Integrates with FlashAttention seamlessly

Generalizes to CNNs and state space models

🔎 Similar Papers

Position IDs Matter: An Enhanced Position Layout for Efficient Context Compression in Large Language Models