Mamba-Adaptor: State Space Model Adaptor for Visual Recognition

📅 2025-05-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Mamba faces three key limitations in vision tasks: (1) lack of global context due to causal modeling, (2) forgetting of long-range dependencies, and (3) weak spatial structure modeling. To address these, we propose Mamba-Adaptor—the first vision-oriented State Space Model (SSM) adapter architecture. It introduces Adaptor-T to enable learnable temporal memory enhancement, mitigating information forgetting, and Adaptor-S incorporating multi-scale dilated convolutions to explicitly encode image spatial priors. The design supports plug-and-play backbone upgrading, performance gains, and efficient fine-tuning. Evaluated on ImageNet classification and COCO detection, Mamba-Adaptor achieves state-of-the-art results. Pretrained backbones exhibit substantial performance improvements, and downstream adaptation requires tuning only 0.5% of parameters—delivering both high accuracy and inference efficiency.

Technology Category

Application Category

📝 Abstract
Recent State Space Models (SSM), especially Mamba, have demonstrated impressive performance in visual modeling and possess superior model efficiency. However, the application of Mamba to visual tasks suffers inferior performance due to three main constraints existing in the sequential model: 1) Casual computing is incapable of accessing global context; 2) Long-range forgetting when computing the current hidden states; 3) Weak spatial structural modeling due to the transformed sequential input. To address these issues, we investigate a simple yet powerful vision task Adaptor for Mamba models, which consists of two functional modules: Adaptor-T and Adaptor-S. When solving the hidden states for SSM, we apply a lightweight prediction module Adaptor-T to select a set of learnable locations as memory augmentations to ease long-range forgetting issues. Moreover, we leverage Adapator-S, composed of multi-scale dilated convolutional kernels, to enhance the spatial modeling and introduce the image inductive bias into the feature output. Both modules can enlarge the context modeling in casual computing, as the output is enhanced by the inaccessible features. We explore three usages of Mamba-Adaptor: A general visual backbone for various vision tasks; A booster module to raise the performance of pretrained backbones; A highly efficient fine-tuning module that adapts the base model for transfer learning tasks. Extensive experiments verify the effectiveness of Mamba-Adaptor in three settings. Notably, our Mamba-Adaptor achieves state-of the-art performance on the ImageNet and COCO benchmarks.
Problem

Research questions and friction points this paper is trying to address.

Mamba models lack global context access in visual tasks
Long-range forgetting weakens hidden state computation
Sequential input transformation impairs spatial structural modeling
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptor-T enhances memory with learnable locations
Adaptor-S improves spatial modeling via dilated convolutions
Mamba-Adaptor boosts performance in diverse vision tasks
🔎 Similar Papers
No similar papers found.
F
Fei Xie
Shanghai Jiao Tong University
J
Jiahao Nie
Hangzhou Dianzi University
Y
Yujin Tang
Shanghai Jiao Tong University
Wenkang Zhang
Wenkang Zhang
Shanghai Jiao Tong University
3D VisionEmbodied AIWorld ModelLearning-based Compression
H
Hongshen Zhao
Southeast University