🤖 AI Summary
This work addresses the challenge of deploying convolutional neural networks (CNNs) on memory-constrained microcontroller units (MCUs), where high peak RAM usage—primarily due to intermediate activation tensors during inference—prevents standalone execution. To overcome this limitation, the authors propose a fine-grained collaborative inference system that moves beyond conventional layer-wise model partitioning by decomposing networks at the granularity of individual convolutional kernels or neurons. A lightweight, resource-aware coordinator dynamically schedules computations across heterogeneous MCUs, enabling efficient utilization of distributed resources. The approach successfully deploys previously infeasible models such as MobileNetV2 on platforms comprising up to eight MCUs, substantially reducing peak memory consumption per MCU while maintaining practical end-to-end inference latency.
📝 Abstract
Running deep neural networks on microcontroller units (MCUs) is severely constrained by limited memory resources. While TinyML techniques reduce model size and computation, they often fail in practice due to excessive peak Random Access Memory (RAM) usage during inference, dominated by intermediate activations. As a result, many models remain infeasible on standalone MCUs. In this work, we present a fine-grained split inference system for networked MCUs that enables collaborative inference of Convolutional Neural Networks (CNN) models across multiple devices. Our key insight is that breaking the memory bottleneck requires splitting inference at sub-layer granularity rather than at layer boundaries. We reinterpret pre-trained models to enable kernel-wise and neuron-wise partitioning, and distribute both model parameters and intermediate activations across multiple MCUs. A lightweight, resource-aware coordinator orchestrates the inference across MCU devices with heterogeneous resources. We implement the proposed system on a real testbed and evaluate it on up to 8 MCUs using MobileNetV2, a representative CNN model. Our experimental results show that CNN models infeasible on a single MCU can be executed across networked MCUs, reducing the per-MCU peak RAM usage while maintaining the practical end-to-end inference latency. All the source code of this work can be found here: https://github.com/shashsuresh/split-inference-on-MCUs.