🤖 AI Summary
To address the challenges of severe memory constraints, absence of native operator support, and immature embedded toolchains on microcontrollers (MCUs), this work presents the first end-to-end deployment of the Mamba state-space model on MCUs. We propose MambaLite-Micro—a lightweight, dependency-free, pure-C inference engine—featuring weight quantization-aware export, hand-optimized operator fusion, and memory-layout co-design to eliminate large intermediate tensors entirely. Our approach reduces peak memory usage by 83.0% and requires no runtime libraries or dynamic memory allocation. On keyword spotting and human activity recognition tasks, it achieves full accuracy parity (100% match) with PyTorch baselines. We validate cross-architecture portability on ESP32-S3 and STM32H7 platforms. This work establishes a practical pathway for deploying state-space models on resource-constrained edge devices.
📝 Abstract
Deploying Mamba models on microcontrollers (MCUs) remains challenging due to limited memory, the lack of native operator support, and the absence of embedded-friendly toolchains. We present, to our knowledge, the first deployment of a Mamba-based neural architecture on a resource-constrained MCU, a fully C-based runtime-free inference engine: MambaLite-Micro. Our pipeline maps a trained PyTorch Mamba model to on-device execution by (1) exporting model weights into a lightweight format, and (2) implementing a handcrafted Mamba layer and supporting operators in C with operator fusion and memory layout optimization. MambaLite-Micro eliminates large intermediate tensors, reducing 83.0% peak memory, while maintaining an average numerical error of only 1.7x10-5 relative to the PyTorch Mamba implementation. When evaluated on keyword spotting(KWS) and human activity recognition (HAR) tasks, MambaLite-Micro achieved 100% consistency with the PyTorch baselines, fully preserving classification accuracy. We further validated portability by deploying on both ESP32S3 and STM32H7 microcontrollers, demonstrating consistent operation across heterogeneous embedded platforms and paving the way for bringing advanced sequence models like Mamba to real-world resource-constrained applications.