🤖 AI Summary
To address computational redundancy in Vision Transformers, this paper proposes Learnable Token Merging (LTM), the first approach to incorporate information bottleneck theory into token compression design. We derive a separable variational upper bound and construct a lightweight, theory-driven mask generation module. LTM is plug-and-play compatible with mainstream architectures—including MobileViT and EfficientViT—without requiring retraining. On multi-task benchmarks (e.g., image classification and object detection), LTM-enhanced Transformers reduce parameters and FLOPs by 15–30% while improving Top-1 accuracy by 0.3–1.2% on average and decreasing inference latency by 43–52% (i.e., accelerating inference by 1.4–2.1×). Our core contributions are: (i) an information bottleneck–guided, learnable token compression paradigm; and (ii) an efficient sparse attention masking mechanism integrated within a broadly compatible architectural design.
📝 Abstract
Self-attention and transformers have been widely used in deep learning. Recent efforts have been devoted to incorporating transformer blocks into different neural architectures, including those with convolutions, leading to various visual transformers for computer vision tasks. In this paper, we propose a novel and compact transformer block, Transformer with Learnable Token Merging (LTM), or LTM-Transformer. LTM-Transformer performs token merging in a learnable scheme. LTM-Transformer is compatible with many popular and compact transformer networks, and it reduces the FLOPs and the inference time of the visual transformers while maintaining or even improving the prediction accuracy. In the experiments, we replace all the transformer blocks in popular visual transformers, including MobileViT, EfficientViT, ViT-S/16, and Swin-T, with LTM-Transformer blocks, leading to LTM-Transformer networks with different backbones. The LTM-Transformer is motivated by reduction of Information Bottleneck, and a novel and separable variational upper bound for the IB loss is derived. The architecture of mask module in our LTM blocks which generate the token merging mask is designed to reduce the derived upper bound for the IB loss. Extensive results on computer vision tasks evidence that LTM-Transformer renders compact and efficient visual transformers with comparable or much better prediction accuracy than the original visual transformers. The code of the LTM-Transformer is available at url{https://github.com/Statistical-Deep-Learning/LTM}.