🤖 AI Summary
This work addresses the challenges of efficiently performing online fine-tuning of deep neural networks on extreme-edge devices, where computational overhead, severe memory constraints, and the complexity of attention mechanisms pose significant barriers. The authors propose the first end-to-end training framework tailored for ultra-low-power RISC-V heterogeneous SoCs, capable of uniformly supporting both CNN and Transformer architectures. The framework integrates several key innovations, including LoRA, selective layer-wise fine-tuning, hardware-accelerated backpropagation, and compressed memory transfers. Evaluated on the CCT model, it achieves an end-to-end fine-tuning throughput of 11 images per second, reduces dynamic memory usage by 23%, decreases trainable parameters and gradients by 15×, cuts memory traffic by 1.6×, and attains 4.6 FLOPs per cycle—demonstrating, for the first time, full online fine-tuning of a complete Transformer model on an extreme-edge SoC.
📝 Abstract
On-device tuning of deep neural networks enables long-term adaptation at the edge while preserving data privacy. However, the high computational and memory demands of backpropagation pose significant challenges for ultra-low-power, memory-constrained extreme-edge devices. These challenges are further amplified for attention-based models due to their architectural complexity and computational scale. We present TrainDeeploy, a framework that unifies efficient inference and on-device training on heterogeneous ultra-low-power System-on-Chips (SoCs). TrainDeeploy provides the first complete on-device training pipeline for extreme-edge SoCs supporting both Convolutional Neural Networks (CNNs) and Transformer models, together with multiple training strategies such as selective layer-wise fine-tuning and Low-Rank Adaptation (LoRA). On a RISC-V-based heterogeneous SoC, we demonstrate the first end-to-end on-device fine-tuning of a Compact Convolutional Transformer (CCT), achieving up to 11 trained images per second. We show that LoRA reduces dynamic memory usage by 23%, decreases the number of trainable parameters and gradients by 15x, and reduces memory transfer volume by 1.6x compared to full backpropagation. TrainDeeploy achieves up to 4.6 FLOP/cycle on CCT (0.28M parameters, 71-126M FLOPs) and up to 13.4 FLOP/cycle on Deep-AE (0.27M parameters, 0.8M FLOPs), while expanding the scope of prior frameworks to support both CNN and Transformer models with parameter-efficient tuning on extreme-edge platforms.