Efficient Visual Transformer by Learnable Token Merging

📅 2024-07-21
🏛️ arXiv.org
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
To address computational redundancy in Vision Transformers, this paper proposes Learnable Token Merging (LTM), the first approach to incorporate information bottleneck theory into token compression design. We derive a separable variational upper bound and construct a lightweight, theory-driven mask generation module. LTM is plug-and-play compatible with mainstream architectures—including MobileViT and EfficientViT—without requiring retraining. On multi-task benchmarks (e.g., image classification and object detection), LTM-enhanced Transformers reduce parameters and FLOPs by 15–30% while improving Top-1 accuracy by 0.3–1.2% on average and decreasing inference latency by 43–52% (i.e., accelerating inference by 1.4–2.1×). Our core contributions are: (i) an information bottleneck–guided, learnable token compression paradigm; and (ii) an efficient sparse attention masking mechanism integrated within a broadly compatible architectural design.

Technology Category

Application Category

📝 Abstract
Self-attention and transformers have been widely used in deep learning. Recent efforts have been devoted to incorporating transformer blocks into different neural architectures, including those with convolutions, leading to various visual transformers for computer vision tasks. In this paper, we propose a novel and compact transformer block, Transformer with Learnable Token Merging (LTM), or LTM-Transformer. LTM-Transformer performs token merging in a learnable scheme. LTM-Transformer is compatible with many popular and compact transformer networks, and it reduces the FLOPs and the inference time of the visual transformers while maintaining or even improving the prediction accuracy. In the experiments, we replace all the transformer blocks in popular visual transformers, including MobileViT, EfficientViT, ViT-S/16, and Swin-T, with LTM-Transformer blocks, leading to LTM-Transformer networks with different backbones. The LTM-Transformer is motivated by reduction of Information Bottleneck, and a novel and separable variational upper bound for the IB loss is derived. The architecture of mask module in our LTM blocks which generate the token merging mask is designed to reduce the derived upper bound for the IB loss. Extensive results on computer vision tasks evidence that LTM-Transformer renders compact and efficient visual transformers with comparable or much better prediction accuracy than the original visual transformers. The code of the LTM-Transformer is available at url{https://github.com/Statistical-Deep-Learning/LTM}.
Problem

Research questions and friction points this paper is trying to address.

Reduces FLOPs and inference time in visual transformers
Maintains or improves prediction accuracy with token merging
Compatible with popular transformer networks like MobileViT, EfficientViT
Innovation

Methods, ideas, or system contributions that make the work stand out.

Learnable Token Merging for efficient transformers
Reduces FLOPs and inference time effectively
Maintains or improves prediction accuracy
🔎 Similar Papers
No similar papers found.