🤖 AI Summary
To address the low energy efficiency and hardware compatibility challenges in deploying Transformer models for TinyML, this paper proposes a heterogeneous acceleration architecture and an end-to-end automated deployment methodology tailored for 8-bit quantized attention inference. The method integrates a RISC-V octa-core cluster with a custom attention accelerator, co-designed under stringent tinyML power constraints. Key contributions include: (1) the first heterogeneous architecture enabling 8-bit quantized Transformer inference within typical tinyML power budgets; and (2) a scalable heterogeneous template implemented in 22 nm FD-SOI, coupled with a compiler-deployment co-optimization flow. Experimental results demonstrate that the system achieves 2960 GOp/J energy efficiency and 154 GOp/s throughput at 0.65 V—setting a new state-of-the-art in energy efficiency for Transformer inference in the TinyML domain.
📝 Abstract
One of the challenges for Tiny Machine Learning (tinyML) is keeping up with the evolution of Machine Learning models from Convolutional Neural Networks to Transformers. We address this by leveraging a heterogeneous architectural template coupling RISC-V processors with hardwired accelerators supported by an automated deployment flow. We demonstrate Attention-based models in a tinyML power envelope with an octa-core cluster coupled with an accelerator for quantized Attention. Our deployment flow enables end-to-end 8-bit Transformer inference, achieving leading-edge energy efficiency and throughput of 2960 GOp/J and 154 GOp/s (0.65 V, 22 nm FD-SOI technology).