Vim-F: Visual State Space Model Benefiting from Learning in the Frequency Domain

๐Ÿ“… 2024-05-29
๐Ÿ›๏ธ arXiv.org
๐Ÿ“ˆ Citations: 1
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address the loss of 2D structural information due to image flattening and the difficulty in jointly modeling local and global contexts in Vision Mamba (ViM), this paper proposes a frequency-spatial joint modeling architecture. Methodologically, it introduces three key innovations: (1) a novel frequency-domain enhancement mechanism leveraging Fast Fourier Transform (FFT) to inject frequency-domain priors, enabling synergistic modeling with spatial-domain Mamba; (2) elimination of positional encoding to fully unleash the long-range dependency modeling capacity of state-space models; and (3) replacement of patch embedding with a convolutional stem to preserve local structure and improve feature initialization quality. Evaluated on benchmarks including ImageNet, the proposed method achieves accuracy comparable to Vision Transformers (ViTs) while significantly outperforming the original ViMโ€”demonstrating superior performance at low computational cost. These results validate the effectiveness and generalizability of dual-domain (frequency + spatial) modeling for visual state-space models.

Technology Category

Application Category

๐Ÿ“ Abstract
In recent years, State Space Models (SSMs) with efficient hardware-aware designs, known as the Mamba deep learning models, have made significant progress in modeling long sequences such as language understanding. Therefore, building efficient and general-purpose visual backbones based on SSMs is a promising direction. Compared to traditional convolutional neural networks (CNNs) and Vision Transformers (ViTs), the performance of Vision Mamba (ViM) methods is not yet fully competitive. To enable SSMs to process image data, ViMs typically flatten 2D images into 1D sequences, inevitably ignoring some 2D local dependencies, thereby weakening the model's ability to interpret spatial relationships from a global perspective. We use Fast Fourier Transform (FFT) to obtain the spectrum of the feature map and add it to the original feature map, enabling ViM to model a unified visual representation in both frequency and spatial domains. The introduction of frequency domain information enables ViM to have a global receptive field during scanning. We propose a novel model called Vim-F, which employs pure Mamba encoders and scans in both the frequency and spatial domains. Moreover, we question the necessity of position embedding in ViM and remove it accordingly in Vim-F, which helps to fully utilize the efficient long-sequence modeling capability of ViM. Finally, we redesign a patch embedding for Vim-F, leveraging a convolutional stem to capture more local correlations, further improving the performance of Vim-F. Code is available at: url{https://github.com/yws-wxs/Vim-F}.
Problem

Research questions and friction points this paper is trying to address.

Mamba model
image processing
performance enhancement
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vim-F Model
Frequency Domain Learning
FFT-based Feature Extraction
๐Ÿ”Ž Similar Papers
No similar papers found.
Juntao Zhang
Juntao Zhang
Henan University
data miningFairness
K
Kun Bian
School of Electronic Engineering, Xidian University
P
Peng Cheng
W
Wenbo An
AMS, Beijing, China
J
Jianning Liu
J
Jun Zhou
AMS, Beijing, China