eMamba: Efficient Acceleration Framework for Mamba Models in Edge Computing

πŸ“… 2025-08-14
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address the absence of hardware-acceleration frameworks tailored for Mamba models on resource-constrained edge devices, this paper proposes the first end-to-end hardware-aware acceleration framework for Mamba. The method introduces three core innovations: (1) a hardware-friendly lightweight normalization layer; (2) SiLU- and exponential-based approximate computation to reduce arithmetic complexity; and (3) approximation-aware neural architecture search (NAS) that jointly optimizes accuracy and efficiency. The framework supports both FPGA and ASIC implementations and is compatible with multimodal tasks. Experiments across multiple benchmarks demonstrate significant improvements: 1.63×–19.9Γ— parameter compression, 4.95×–5.62Γ— latency reduction, and up to 48.6Γ— energy-efficiency gainβ€”all while preserving high model accuracy.

Technology Category

Application Category

πŸ“ Abstract
State Space Model (SSM)-based machine learning architectures have recently gained significant attention for processing sequential data. Mamba, a recent sequence-to-sequence SSM, offers competitive accuracy with superior computational efficiency compared to state-of-the-art transformer models. While this advantage makes Mamba particularly promising for resource-constrained edge devices, no hardware acceleration frameworks are currently optimized for deploying it in such environments. This paper presents eMamba, a comprehensive end-to-end hardware acceleration framework explicitly designed for deploying Mamba models on edge platforms. eMamba maximizes computational efficiency by replacing complex normalization layers with lightweight hardware-aware alternatives and approximating expensive operations, such as SiLU activation and exponentiation, considering the target applications. Then, it performs an approximation-aware neural architecture search (NAS) to tune the learnable parameters used during approximation. Evaluations with Fashion-MNIST, CIFAR-10, and MARS, an open-source human pose estimation dataset, show eMamba achieves comparable accuracy to state-of-the-art techniques using 1.63-19.9$ imes$ fewer parameters. In addition, it generalizes well to large-scale natural language tasks, demonstrating stable perplexity across varying sequence lengths on the WikiText2 dataset. We also quantize and implement the entire eMamba pipeline on an AMD ZCU102 FPGA and ASIC using GlobalFoundries (GF) 22 nm technology. Experimental results show 4.95-5.62$ imes$ lower latency and 2.22-9.95$ imes$ higher throughput, with 4.77$ imes$ smaller area, 9.84$ imes$ lower power, and 48.6$ imes$ lower energy consumption than baseline solutions while maintaining competitive accuracy.
Problem

Research questions and friction points this paper is trying to address.

Optimizing Mamba models for edge computing efficiency
Developing hardware acceleration for resource-constrained edge devices
Reducing computational complexity while maintaining model accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight hardware-aware normalization layers replacement
Approximation-aware neural architecture search (NAS)
Efficient FPGA and ASIC implementation with quantization
πŸ”Ž Similar Papers
No similar papers found.
J
Jiyong Kim
University of Ulsan, Republic of Korea
J
Jaeho Lee
University of Ulsan, Republic of Korea
J
Jiahao Lin
University of Wisconsin-Madison, USA
Alish Kanani
Alish Kanani
University of Wisconsin-Madison
ChipletsThermal managementPerformance ModelingTask SchedulingApproximate Circuits
Miao Sun
Miao Sun
WeRide
Computer VisionAutonomous Driving
U
Umit Y. Ogras
University of Wisconsin-Madison, USA
J
Jaehyun Park
University of Ulsan, Republic of Korea