FastMamba: A High-Speed and Efficient Mamba Accelerator on FPGA with Accurate Quantization

📅 2025-05-25

📈 Citations: 0

✨ Influential: 0

career value

261K/year

🤖 AI Summary

Deploying Mamba2 on edge FPGAs faces three key challenges: (1) outlier values in linear layers degrading quantization accuracy, (2) hardware-unfriendly nonlinear operations (e.g., exp/sigmoid) in State Space Model (SSM) blocks, and (3) irregular tensor access patterns hindering efficient hardware mapping. Method: We propose an algorithm–hardware co-design framework: (1) a Hadamard transform-based preprocessing to suppress linear-layer outliers, enabling stable 8-bit quantization; (2) a fine-grained power-of-two quantization scheme with first-order linear approximations replacing exp/sigmoid in SSMs; and (3) a pipelined, parallel vector-processing architecture tailored for SSM computation. Results: On a Xilinx VC709 FPGA, our implementation achieves 68.8× and 8.9× speedup over an Intel Xeon CPU and NVIDIA RTX 3090 GPU, respectively, for Mamba2-130M prefill; for Mamba2-2.7B decoding, it delivers 6× higher energy efficiency than the RTX 3090.

Technology Category

Application Category

📝 Abstract

State Space Models (SSMs), like recent Mamba2, have achieved remarkable performance and received extensive attention. However, deploying Mamba2 on resource-constrained edge devices encounters many problems: severe outliers within the linear layer challenging the quantization, diverse and irregular element-wise tensor operations, and hardware-unfriendly nonlinear functions in the SSM block. To address these issues, this paper presents FastMamba, a dedicated accelerator on FPGA with hardware-algorithm co-design to promote the deployment efficiency of Mamba2. Specifically, we successfully achieve 8-bit quantization for linear layers through Hadamard transformation to eliminate outliers. Moreover, a hardware-friendly and fine-grained power-of-two quantization framework is presented for the SSM block and convolution layer, and a first-order linear approximation is developed to optimize the nonlinear functions. Based on the accurate algorithm quantization, we propose an accelerator that integrates parallel vector processing units, pipelined execution dataflow, and an efficient SSM Nonlinear Approximation Unit, which enhances computational efficiency and reduces hardware complexity. Finally, we evaluate FastMamba on Xilinx VC709 FPGA. For the input prefill task on Mamba2-130M, FastMamba achieves 68.80 imes and 8.90 imes speedup over Intel Xeon 4210R CPU and NVIDIA RTX 3090 GPU, respectively. In the output decode experiment with Mamba2-2.7B, FastMamba attains 6 imes higher energy efficiency than RTX 3090 GPU.

Problem

Research questions and friction points this paper is trying to address.

Deploying Mamba2 on edge devices faces quantization challenges due to outliers

Diverse tensor operations and nonlinear functions hinder hardware efficiency

Resource constraints limit Mamba2's deployment speed and energy efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

8-bit quantization via Hadamard transformation

Hardware-friendly power-of-two quantization framework

First-order linear approximation for nonlinear functions

🔎 Similar Papers

SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration

2024-10-03arXiv.orgCitations: 2

FlowPrecision: Advancing FPGA-Based Real-Time Fluid Flow Estimation with Linear Quantization

2024-03-042024 IEEE International Conference on Pervasive Computing and Communications Workshops and other Affiliated Events (PerCom Workshops)Citations: 5

💼 Related Jobs

Senior AI Software Architect

Microsoft

$119,800 -

United States, Washington, Redmond

ML Features Solutions Engineer

SambaNova Systems

Bay Area

Senior AI Research Quantization Engineer

Qualcomm

$140,800.00 - $211,200.00

San Diego, California, United States of America