MambaLite-Micro: Memory-Optimized Mamba Inference on MCUs

📅 2025-09-05

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

To address the challenges of severe memory constraints, absence of native operator support, and immature embedded toolchains on microcontrollers (MCUs), this work presents the first end-to-end deployment of the Mamba state-space model on MCUs. We propose MambaLite-Micro—a lightweight, dependency-free, pure-C inference engine—featuring weight quantization-aware export, hand-optimized operator fusion, and memory-layout co-design to eliminate large intermediate tensors entirely. Our approach reduces peak memory usage by 83.0% and requires no runtime libraries or dynamic memory allocation. On keyword spotting and human activity recognition tasks, it achieves full accuracy parity (100% match) with PyTorch baselines. We validate cross-architecture portability on ESP32-S3 and STM32H7 platforms. This work establishes a practical pathway for deploying state-space models on resource-constrained edge devices.

Technology Category

Application Category

📝 Abstract

Deploying Mamba models on microcontrollers (MCUs) remains challenging due to limited memory, the lack of native operator support, and the absence of embedded-friendly toolchains. We present, to our knowledge, the first deployment of a Mamba-based neural architecture on a resource-constrained MCU, a fully C-based runtime-free inference engine: MambaLite-Micro. Our pipeline maps a trained PyTorch Mamba model to on-device execution by (1) exporting model weights into a lightweight format, and (2) implementing a handcrafted Mamba layer and supporting operators in C with operator fusion and memory layout optimization. MambaLite-Micro eliminates large intermediate tensors, reducing 83.0% peak memory, while maintaining an average numerical error of only 1.7x10-5 relative to the PyTorch Mamba implementation. When evaluated on keyword spotting(KWS) and human activity recognition (HAR) tasks, MambaLite-Micro achieved 100% consistency with the PyTorch baselines, fully preserving classification accuracy. We further validated portability by deploying on both ESP32S3 and STM32H7 microcontrollers, demonstrating consistent operation across heterogeneous embedded platforms and paving the way for bringing advanced sequence models like Mamba to real-world resource-constrained applications.

Problem

Research questions and friction points this paper is trying to address.

Deploying Mamba models on memory-limited microcontrollers efficiently

Lack of native operator support and embedded toolchains for Mamba

Reducing peak memory usage while maintaining model accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight model weight export format

Handcrafted C-based Mamba layer implementation

Operator fusion and memory layout optimization

🔎 Similar Papers

Toward Attention-based TinyML: A Heterogeneous Accelerated Architecture and Automated Deployment Flow

2024-08-05arXiv.orgCitations: 0

Apple

Seattle, United States of America

Authors to Follow