MaMe & MaRe: Matrix-Based Token Merging and Restoration for Efficient Visual Perception and Synthesis

📅 2026-04-14

📈 Citations: 0

✨ Influential: 0

career value

240K/year

🤖 AI Summary

This work addresses the quadratic computational complexity of self-attention in Vision Transformers due to large token counts and the poor GPU efficiency of existing token compression methods. To overcome these limitations, the authors propose MaMe (Matrix-based token Merging) and MaRe (token Restoration), the first fully differentiable, GPU-friendly approach that relies exclusively on matrix operations to enable efficient token compression and reconstruction without any additional training. The method is compatible with mainstream architectures including ViT, VideoMAE, and Stable Diffusion. Experiments demonstrate a 2× throughput gain for ViT-B with only a 2% accuracy drop, and after fine-tuning, a 1.1× speedup with a 1.0% accuracy improvement. Further results include a 1.3× acceleration in SigLIP zero-shot classification, a 48.5% speedup for VideoMAE-L on Kinetics-400, and a 31% reduction in generation latency for Stable Diffusion v2.1 alongside enhanced image quality.

Technology Category

Application Category

📝 Abstract

Token compression is crucial for mitigating the quadratic complexity of self-attention mechanisms in Vision Transformers (ViTs), which often involve numerous input tokens. Existing methods, such as ToMe, rely on GPU-inefficient operations (e.g., sorting, scattered writes), introducing overheads that limit their effectiveness. We introduce MaMe, a training-free, differentiable token merging method based entirely on matrix operations, which is GPU-friendly to accelerate ViTs. Additionally, we present MaRe, its inverse operation, for token restoration, forming a MaMe+MaRe pipeline for image synthesis. When applied to pre-trained models, MaMe doubles ViT-B throughput with a 2% accuracy drop. Notably, fine-tuning the last layer with MaMe boosts ViT-B accuracy by 1.0% at 1.1x speed. In SigLIP2-B@512 zero-shot classification, MaMe provides 1.3x acceleration with negligible performance degradation. In video tasks, MaMe accelerates VideoMAE-L by 48.5% on Kinetics-400 with only a 0.84% accuracy loss. Furthermore, MaMe achieves simultaneous improvements in both performance and speed on some tasks. In image synthesis, the MaMe+MaRe pipeline enhances quality while reducing Stable Diffusion v2.1 generation latency by 31%. Collectively, these results demonstrate MaMe's and MaRe's effectiveness in accelerating vision models. The code is available at https://github.com/cominder/mame}{https://github.com/cominder/mame.

Problem

Research questions and friction points this paper is trying to address.

Vision Transformers

token compression

self-attention complexity

efficient inference

visual perception

Innovation

Methods, ideas, or system contributions that make the work stand out.

token merging

matrix operations

Vision Transformers