🤖 AI Summary
To address resource constraints, low automation, and poor task generalizability in deploying Transformers on embedded FPGAs, this paper introduces the first fully automated Tiny Transformer deployment framework tailored for time-series analysis—including forecasting, classification, and anomaly detection. Methodologically, it integrates 4-bit quantization-aware training, Optuna-driven hardware-aware hyperparameter search, and automatic VHDL code generation. This enables, for the first time, integer-only, task-specialized encoder-only accelerators on lightweight FPGAs such as Xilinx Spartan-7 and Lattice iCE40. Experimental results demonstrate 0.033 mJ/inference energy efficiency and millisecond-scale latency on AMD Spartan-7. The framework is validated across six public time-series benchmarks and two embedded FPGA platforms. All source code is publicly released.
📝 Abstract
Transformer-based models have shown strong performance across diverse time-series tasks, but their deployment on resource-constrained devices remains challenging due to high memory and computational demand. While prior work targeting Microcontroller Units (MCUs) has explored hardware-specific optimizations, such approaches are often task-specific and limited to 8-bit fixed-point precision. Field-Programmable Gate Arrays (FPGAs) offer greater flexibility, enabling fine-grained control over data precision and architecture. However, existing FPGA-based deployments of Transformers for time-series analysis typically focus on high-density platforms with manual configuration. This paper presents a unified and fully automated deployment framework for Tiny Transformers on embedded FPGAs. Our framework supports a compact encoder-only Transformer architecture across three representative time-series tasks (forecasting, classification, and anomaly detection). It combines quantization-aware training (down to 4 bits), hardware-aware hyperparameter search using Optuna, and automatic VHDL generation for seamless deployment. We evaluate our framework on six public datasets across two embedded FPGA platforms. Results show that our framework produces integer-only, task-specific Transformer accelerators achieving as low as 0.033 mJ per inference with millisecond latency on AMD Spartan-7, while also providing insights into deployment feasibility on Lattice iCE40. All source code will be released in the GitHub repository (https://github.com/Edwina1030/TinyTransformer4TS).