🤖 AI Summary
Vision Mamba suffers from poor parallelism, memory bandwidth bottlenecks, and low GPU utilization on edge devices due to its sequential scan operation. To address these challenges, this work proposes an end-to-end hardware–software co-optimization framework. We design a dedicated systolic scan array to accelerate the state-space model’s sequential scanning path in hardware, enabling fine-grained parallelism. Additionally, we introduce a hardware-friendly hybrid quantization scheme (FP16/INT8 co-quantization) that compresses weights and intermediate activations without any accuracy loss. Experimental results demonstrate that our approach reduces inference latency by 42% and memory footprint by 57% compared to standard Transformer-based inference. Consequently, it significantly improves throughput and energy efficiency on edge AI chips. This work establishes a scalable hardware acceleration paradigm for Mamba-style models in resource-constrained deployment scenarios.
📝 Abstract
Transformers have proven effective in language modeling but are limited by high computational and memory demands that grow quadratically with input sequence length. State space models (SSMs) offer a promising alternative by reducing attention complexity from $O(L^2)$ to $O(L)$ while also lowering overall memory consumption. Vision Mamba adapts the SSM approach for computer vision tasks, achieving lower latency and memory consumption than traditional transformer models. However, deploying Vision Mamba on edge devices is challenging due to its sequential scan operations, which hinder GPU efficiency. We propose Mamba-X, an end-to-end Vision Mamba accelerator that includes a systolic scan array to maximize parallelism and minimize memory traffic, along with a hybrid, hardware-friendly quantization technique to reduce memory usage and improve hardware efficiency without sacrificing accuracy.