RMT: Retentive Networks Meet Vision Transformers

📅 2023-09-20
🏛️ Computer Vision and Pattern Recognition
📈 Citations: 47
Influential: 4
📄 PDF

career value

185K/year
🤖 AI Summary
Vision Transformers (ViTs) suffer from the absence of explicit spatial inductive bias and quadratic computational complexity $O(N^2)$ in self-attention, limiting both efficiency and generalization. To address this, we propose Spatial Retentive Transformer (S-RetNet), the first adaptation of the temporal decay mechanism from RetNet—originally designed for NLP—to vision. Specifically, we introduce a Manhattan-distance-based spatial decay matrix to encode geometric priors, and reformulate attention via a decomposition scheme enabling linear-complexity global modeling while preserving structural inductive bias. Evaluated on standard benchmarks, S-RetNet achieves 84.8% and 86.1% top-1 accuracy on ImageNet-1K with 27M and 96M parameters, respectively; 54.5 box AP and 47.2 mask AP on COCO object detection; and 52.8 mIoU on ADE20K semantic segmentation. These results demonstrate substantial improvements in the accuracy–efficiency trade-off over prior ViT-based models.
📝 Abstract
Vision Transformer (ViT) has gained increasing attention in the computer vision community in recent years. How-ever, the core component of ViT, Self-Attention, lacks ex-plicit spatial priors and bears a quadratic computational complexity, thereby constraining the applicability of ViT. To alleviate these issues, we draw inspiration from the re-cent Retentive Network (RetNet) in the field of NLP, and propose RMT, a strong vision backbone with explicit spa-tial prior for general purposes. Specifically, we extend the RetNet's temporal decay mechanism to the spatial do-main, and propose a spatial decay matrix based on the Manhattan distance to introduce the explicit spatial prior to Self-Attention. Additionally, an attention decomposition form that adeptly adapts to explicit spatial prior is proposed, aiming to reduce the computational burden of modeling global information without disrupting the spa-tial decay matrix. Based on the spatial decay matrix and the attention decomposition form, we can flexibly integrate explicit spatial prior into the vision backbone with lin-ear complexity. Extensive experiments demonstrate that RMT exhibits exceptional performance across various vision tasks. Specifically, without extra training data, RMT achieves 84.8% and 86.1% top-l acc on ImageNet-lk with 27MI4.5GFLOPs and 96M/18.2GFLOPs. For downstream tasks, RMT achieves 54.5 box AP and 47.2 mask AP on the COCO detection task, and 52.8 mloU on the ADE20K se-mantic segmentation task.
Problem

Research questions and friction points this paper is trying to address.

Improves Vision Transformers with spatial priors
Reduces computational complexity of Self-Attention
Enhances performance in various vision tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Extends RetNet to spatial domain
Introduces spatial decay matrix
Reduces computational complexity linearly
🔎 Similar Papers
No similar papers found.