Integer-only Quantized Transformers for Embedded FPGA-based Time-series Forecasting in AIoT

๐Ÿ“… 2024-07-06
๐Ÿ›๏ธ 2024 IEEE Annual Congress on Artificial Intelligence of Things (AIoT)
๐Ÿ“ˆ Citations: 6
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address time-series forecasting demands on resource-constrained AIoT edge devices, this work proposes a hardware-accelerated integer-quantized Transformer tailored for embedded FPGAs. Methodologically, it integrates quantization-aware training (QAT) with 4-/6-bit integer quantization and designs a customized low-bit hardware architecture; notably, it achieves the first FPGA implementation of a 4-bit integer-quantized Transformer on a Spartan-7 device. Through joint optimization of resource utilization, power consumption, and timing, it overcomes the trade-off bottleneck among accuracy, energy efficiency, and latency under ultra-low bitwidths. Experiments show that the 4-bit model incurs only a 0.63% accuracy degradation versus the floating-point baseline, delivers 132.33ร— higher inference throughput than its 8-bit counterpart, and reduces energy consumption by 48.19ร—โ€”revealing a non-monotonic relationship between quantization bitwidth and energy/latency. This work validates the feasibility of deploying Transformers on ultra-low-power IoT endpoints and establishes a reproducible software-hardware co-design paradigm for edge time-series intelligence.

Technology Category

Application Category

๐Ÿ“ Abstract
This paper presents the design of a hardware accelerator for Transformers, optimized for on-device time-series forecasting in AIoT systems. It integrates integer-only quantization and Quantization-Aware Training with optimized hardware designs to realize 6-bit and 4-bit quantized Transformer models, which achieved precision comparable to 8-bit quantized models from related research. Utilizing a complete implementation on an embedded FPGA (Xilinx Spartan-7 XC7S15), we examine the feasibility of deploying Transformer models on embedded IoT devices. This includes a thorough analysis of achievable precision, resource utilization, timing, power, and energy consumption for on-device inference. Our results indicate that while sufficient performance can be attained, the optimization process is not trivial. For instance, reducing the quantization bitwidth does not consistently result in decreased latency or energy consumption, underscoring the necessity of systematically exploring various optimization combinations. Compared to an 8-bit quantized Transformer model in related studies, our 4-bit quantized Transformer model increases test loss by only 0.63%, operates up to 132.33ร— faster, and consumes 48.19ร—less energy.
Problem

Research questions and friction points this paper is trying to address.

Designing efficient hardware accelerators for embedded AIoT time-series forecasting
Implementing low-bit quantization (4-6 bit) while maintaining model precision
Optimizing resource utilization and energy consumption for FPGA deployment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integer-only quantization for Transformer models
Quantization-aware training with optimized hardware design
Embedded FPGA deployment for time-series forecasting
๐Ÿ”Ž Similar Papers
No similar papers found.