🤖 AI Summary
To address the challenge of end-to-end Transformer training on memory- and compute-constrained edge devices (e.g., smartphones, cameras), this work presents the first FPGA-based, ultra-low-memory, fully on-chip Transformer training framework. Methodologically, it introduces: (1) a novel bidirectional tensor contraction streaming algorithm to drastically compress gradients and activations; (2) a fully on-chip memory training architecture integrating low-rank tensor compression, customized compute kernels, intra-layer parallelism, and pipelined scheduling; and (3) BRAM/URAM-coordinated on-chip storage for all parameters and gradients. On an AMD Alveo U50 FPGA, it trains models of 36.7–93.5 MB per batch using only <6 MB BRAM + 22.5 MB URAM—reducing memory overhead by 30–51× versus prior approaches—and cuts per-epoch energy consumption to 28% of that of a GPU. This is the first demonstration of complete, end-to-end Transformer training on FPGA.
📝 Abstract
Transformer models have achieved state-of-the-art performance across a wide range of machine learning tasks. There is growing interest in training transformers on resource-constrained edge devices due to considerations such as privacy, domain adaptation, and on-device scientific machine learning. However, the significant computational and memory demands required for transformer training often exceed the capabilities of an edge device. Leveraging low-rank tensor compression, this paper presents the first on-FPGA accelerator for end-to-end transformer training. On the algorithm side, we present a bi-directional contraction flow for tensorized transformer training, significantly reducing the computational FLOPS and intra-layer memory costs compared to existing tensor operations. On the hardware side, we store all highly compressed model parameters and gradient information on chip, creating an on-chip-memory-only framework for each stage in training. This reduces off-chip communication and minimizes latency and energy costs. Additionally, we implement custom computing kernels for each training stage and employ intra-layer parallelism and pipe-lining to further enhance run-time and memory efficiency. Through experiments on transformer models within $36.7$ to $93.5$ MB using FP-32 data formats on the ATIS dataset, our tensorized FPGA accelerator could conduct single-batch end-to-end training on the AMD Alevo U50 FPGA, with a memory budget of less than $6$-MB BRAM and $22.5$-MB URAM. Compared to uncompressed training on the NVIDIA RTX 3090 GPU, our on-FPGA training achieves a memory reduction of $30 imes$ to $51 imes$. Our FPGA accelerator also achieves up to $3.6 imes$ less energy cost per epoch compared with tensor Transformer training on an NVIDIA RTX 3090 GPU.