Fast Vision Mamba: Pooling Spatial Dimensions for Accelerated Processing

📅 2025-02-01

📈 Citations: 0

✨ Influential: 0

career value

244K/year

🤖 AI Summary

Vision Mamba and other vision large models suffer from low inference efficiency and sequence-length limitations when processing ultra-high-resolution images (e.g., 2048×2048). Method: This paper proposes FastVim, a novel architecture that introduces cross-Mamba-block alternating spatial-dimension token pooling—a first-of-its-kind dynamic downsampling strategy—halving the parallel scanning steps of state space models (SSMs) and overcoming the linear sequence modeling bottleneck. FastVim integrates SSMs with selective scanning to enable spatially aware computational compression. Contribution/Results: Extensive experiments demonstrate that FastVim achieves state-of-the-art accuracy across diverse vision tasks—including image classification, semantic segmentation, object detection, and cellular perturbation prediction—while delivering a 72.5% speedup in inference latency. This significantly improves throughput and scalability for ultra-high-resolution vision workloads.

Technology Category

Application Category

📝 Abstract

State Space Models (SSMs) with selective scan (Mamba) have been adapted into efficient vision models. Mamba, unlike Vision Transformers, achieves linear complexity for token interactions through a recurrent hidden state process. This sequential processing is enhanced by a parallel scan algorithm, which reduces the computational time of recurrent steps from $L$ sequential steps to $log(L)$ parallel steps with respect to the number of input tokens ($L$). In this work, we propose Fast Vision Mamba (FastVim), that further reduces the computational time of the SSM block by reducing the number of recurrent steps in Vision Mamba models while still retaining model performance. By alternately pooling tokens along image dimensions across Mamba blocks, we obtain a 2$ imes$ reduction in the number of parallel steps in SSM block. Our model offers up to $72.5%$ speedup in inference speed compared to baseline Vision Mamba models on high resolution (2048$ imes$2048) images. Our experiments demonstrate state-of-the-art performance with dramatically improved throughput in a range of tasks such as image classification, cell perturbation prediction, segmentation, and object detection. Code is made available at https://github.com/insitro/FastVim

Problem

Research questions and friction points this paper is trying to address.

Efficient Image Processing

Ultra-large Image Handling

Accelerated Computer Vision Tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

FastVim

Image Processing Speed

Multitask Performance

🔎 Similar Papers

MambaVision: A Hybrid Mamba-Transformer Vision Backbone