π€ AI Summary
To address the absence of hardware-acceleration frameworks tailored for Mamba models on resource-constrained edge devices, this paper proposes the first end-to-end hardware-aware acceleration framework for Mamba. The method introduces three core innovations: (1) a hardware-friendly lightweight normalization layer; (2) SiLU- and exponential-based approximate computation to reduce arithmetic complexity; and (3) approximation-aware neural architecture search (NAS) that jointly optimizes accuracy and efficiency. The framework supports both FPGA and ASIC implementations and is compatible with multimodal tasks. Experiments across multiple benchmarks demonstrate significant improvements: 1.63Γβ19.9Γ parameter compression, 4.95Γβ5.62Γ latency reduction, and up to 48.6Γ energy-efficiency gainβall while preserving high model accuracy.
π Abstract
State Space Model (SSM)-based machine learning architectures have recently gained significant attention for processing sequential data. Mamba, a recent sequence-to-sequence SSM, offers competitive accuracy with superior computational efficiency compared to state-of-the-art transformer models. While this advantage makes Mamba particularly promising for resource-constrained edge devices, no hardware acceleration frameworks are currently optimized for deploying it in such environments. This paper presents eMamba, a comprehensive end-to-end hardware acceleration framework explicitly designed for deploying Mamba models on edge platforms. eMamba maximizes computational efficiency by replacing complex normalization layers with lightweight hardware-aware alternatives and approximating expensive operations, such as SiLU activation and exponentiation, considering the target applications. Then, it performs an approximation-aware neural architecture search (NAS) to tune the learnable parameters used during approximation. Evaluations with Fashion-MNIST, CIFAR-10, and MARS, an open-source human pose estimation dataset, show eMamba achieves comparable accuracy to state-of-the-art techniques using 1.63-19.9$ imes$ fewer parameters. In addition, it generalizes well to large-scale natural language tasks, demonstrating stable perplexity across varying sequence lengths on the WikiText2 dataset. We also quantize and implement the entire eMamba pipeline on an AMD ZCU102 FPGA and ASIC using GlobalFoundries (GF) 22 nm technology. Experimental results show 4.95-5.62$ imes$ lower latency and 2.22-9.95$ imes$ higher throughput, with 4.77$ imes$ smaller area, 9.84$ imes$ lower power, and 48.6$ imes$ lower energy consumption than baseline solutions while maintaining competitive accuracy.