MambaLite-Micro: Memory-Optimized Mamba Inference on MCUs

📅 2025-09-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenges of severe memory constraints, absence of native operator support, and immature embedded toolchains on microcontrollers (MCUs), this work presents the first end-to-end deployment of the Mamba state-space model on MCUs. We propose MambaLite-Micro—a lightweight, dependency-free, pure-C inference engine—featuring weight quantization-aware export, hand-optimized operator fusion, and memory-layout co-design to eliminate large intermediate tensors entirely. Our approach reduces peak memory usage by 83.0% and requires no runtime libraries or dynamic memory allocation. On keyword spotting and human activity recognition tasks, it achieves full accuracy parity (100% match) with PyTorch baselines. We validate cross-architecture portability on ESP32-S3 and STM32H7 platforms. This work establishes a practical pathway for deploying state-space models on resource-constrained edge devices.

Technology Category

Application Category

📝 Abstract
Deploying Mamba models on microcontrollers (MCUs) remains challenging due to limited memory, the lack of native operator support, and the absence of embedded-friendly toolchains. We present, to our knowledge, the first deployment of a Mamba-based neural architecture on a resource-constrained MCU, a fully C-based runtime-free inference engine: MambaLite-Micro. Our pipeline maps a trained PyTorch Mamba model to on-device execution by (1) exporting model weights into a lightweight format, and (2) implementing a handcrafted Mamba layer and supporting operators in C with operator fusion and memory layout optimization. MambaLite-Micro eliminates large intermediate tensors, reducing 83.0% peak memory, while maintaining an average numerical error of only 1.7x10-5 relative to the PyTorch Mamba implementation. When evaluated on keyword spotting(KWS) and human activity recognition (HAR) tasks, MambaLite-Micro achieved 100% consistency with the PyTorch baselines, fully preserving classification accuracy. We further validated portability by deploying on both ESP32S3 and STM32H7 microcontrollers, demonstrating consistent operation across heterogeneous embedded platforms and paving the way for bringing advanced sequence models like Mamba to real-world resource-constrained applications.
Problem

Research questions and friction points this paper is trying to address.

Deploying Mamba models on memory-limited microcontrollers efficiently
Lack of native operator support and embedded toolchains for Mamba
Reducing peak memory usage while maintaining model accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight model weight export format
Handcrafted C-based Mamba layer implementation
Operator fusion and memory layout optimization
🔎 Similar Papers
No similar papers found.
Hongjun Xu
Hongjun Xu
Northwestern University, Evanston, IL
J
Junxi Xia
Northwestern University, Evanston, IL
W
Weisi Yang
Northwestern University, Evanston, IL
Y
Yueyuan Sui
Northwestern University, Evanston, IL
Stephen Xia
Stephen Xia
Northwestern University
Embedded IntelligenceMobile and Embedded SystemsCyber Physical SystemsSmart Environments