eMamba: Efficient Acceleration Framework for Mamba Models in Edge Computing

📅 2025-08-14

📈 Citations: 0

✨ Influential: 0

career value

237K/year

🤖 AI Summary

To address the absence of hardware-acceleration frameworks tailored for Mamba models on resource-constrained edge devices, this paper proposes the first end-to-end hardware-aware acceleration framework for Mamba. The method introduces three core innovations: (1) a hardware-friendly lightweight normalization layer; (2) SiLU- and exponential-based approximate computation to reduce arithmetic complexity; and (3) approximation-aware neural architecture search (NAS) that jointly optimizes accuracy and efficiency. The framework supports both FPGA and ASIC implementations and is compatible with multimodal tasks. Experiments across multiple benchmarks demonstrate significant improvements: 1.63×–19.9× parameter compression, 4.95×–5.62× latency reduction, and up to 48.6× energy-efficiency gain—all while preserving high model accuracy.

Technology Category

Application Category

📝 Abstract

State Space Model (SSM)-based machine learning architectures have recently gained significant attention for processing sequential data. Mamba, a recent sequence-to-sequence SSM, offers competitive accuracy with superior computational efficiency compared to state-of-the-art transformer models. While this advantage makes Mamba particularly promising for resource-constrained edge devices, no hardware acceleration frameworks are currently optimized for deploying it in such environments. This paper presents eMamba, a comprehensive end-to-end hardware acceleration framework explicitly designed for deploying Mamba models on edge platforms. eMamba maximizes computational efficiency by replacing complex normalization layers with lightweight hardware-aware alternatives and approximating expensive operations, such as SiLU activation and exponentiation, considering the target applications. Then, it performs an approximation-aware neural architecture search (NAS) to tune the learnable parameters used during approximation. Evaluations with Fashion-MNIST, CIFAR-10, and MARS, an open-source human pose estimation dataset, show eMamba achieves comparable accuracy to state-of-the-art techniques using 1.63-19.9$ imes$ fewer parameters. In addition, it generalizes well to large-scale natural language tasks, demonstrating stable perplexity across varying sequence lengths on the WikiText2 dataset. We also quantize and implement the entire eMamba pipeline on an AMD ZCU102 FPGA and ASIC using GlobalFoundries (GF) 22 nm technology. Experimental results show 4.95-5.62$ imes$ lower latency and 2.22-9.95$ imes$ higher throughput, with 4.77$ imes$ smaller area, 9.84$ imes$ lower power, and 48.6$ imes$ lower energy consumption than baseline solutions while maintaining competitive accuracy.

Problem

Research questions and friction points this paper is trying to address.

Optimizing Mamba models for edge computing efficiency

Developing hardware acceleration for resource-constrained edge devices

Reducing computational complexity while maintaining model accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight hardware-aware normalization layers replacement

Approximation-aware neural architecture search (NAS)

Efficient FPGA and ASIC implementation with quantization

🔎 Similar Papers

On Efficient Variants of Segment Anything Model: A Survey

2024-10-07arXiv.orgCitations: 7

EdgeLLM: A Highly Efficient CPU-FPGA Heterogeneous Edge Accelerator for Large Language Models

2024-07-31arXiv.orgCitations: 4

Qualcomm

$140,800.00 - $211,200.00

San Diego, California, United States of America

Authors to Follow