๐ค AI Summary
To address time-series forecasting demands on resource-constrained AIoT edge devices, this work proposes a hardware-accelerated integer-quantized Transformer tailored for embedded FPGAs. Methodologically, it integrates quantization-aware training (QAT) with 4-/6-bit integer quantization and designs a customized low-bit hardware architecture; notably, it achieves the first FPGA implementation of a 4-bit integer-quantized Transformer on a Spartan-7 device. Through joint optimization of resource utilization, power consumption, and timing, it overcomes the trade-off bottleneck among accuracy, energy efficiency, and latency under ultra-low bitwidths. Experiments show that the 4-bit model incurs only a 0.63% accuracy degradation versus the floating-point baseline, delivers 132.33ร higher inference throughput than its 8-bit counterpart, and reduces energy consumption by 48.19รโrevealing a non-monotonic relationship between quantization bitwidth and energy/latency. This work validates the feasibility of deploying Transformers on ultra-low-power IoT endpoints and establishes a reproducible software-hardware co-design paradigm for edge time-series intelligence.
๐ Abstract
This paper presents the design of a hardware accelerator for Transformers, optimized for on-device time-series forecasting in AIoT systems. It integrates integer-only quantization and Quantization-Aware Training with optimized hardware designs to realize 6-bit and 4-bit quantized Transformer models, which achieved precision comparable to 8-bit quantized models from related research. Utilizing a complete implementation on an embedded FPGA (Xilinx Spartan-7 XC7S15), we examine the feasibility of deploying Transformer models on embedded IoT devices. This includes a thorough analysis of achievable precision, resource utilization, timing, power, and energy consumption for on-device inference. Our results indicate that while sufficient performance can be attained, the optimization process is not trivial. For instance, reducing the quantization bitwidth does not consistently result in decreased latency or energy consumption, underscoring the necessity of systematically exploring various optimization combinations. Compared to an 8-bit quantized Transformer model in related studies, our 4-bit quantized Transformer model increases test loss by only 0.63%, operates up to 132.33ร faster, and consumes 48.19รless energy.