FAMOUS: Flexible Accelerator for the Attention Mechanism of Transformer on UltraScale+ FPGAs

📅 2024-09-21
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the lack of dedicated hardware accelerators for Multi-Head Attention (MHA) in Transformers, this paper proposes a configurable MHA acceleration architecture tailored for ultra-large-scale FPGAs. The architecture introduces a novel flexible design supporting dynamic reconfiguration of the number of attention heads, embedding dimension, and tiling block size. It integrates cross-platform tiling scheduling, blocked matrix computation, pipelined processing element (PE) arrays, and tightly coupled BRAM storage optimization. Implemented at the RTL level on Xilinx UltraScale+ FPGAs and deployed on the Alveo U55C, the accelerator achieves 328 GOPS throughput under an 8-head/768-dimension/64-block configuration. Compared to state-of-the-art CPU, GPU, and FPGA baselines, it delivers 3.28×, 2.6×, and 1.3× speedups, respectively, while significantly improving matrix-level parallelism and on-chip memory utilization.

Technology Category

Application Category

📝 Abstract
Transformer neural networks (TNNs) are being applied across a widening range of application domains, including natural language processing (NLP), machine translation, and computer vision (CV). Their popularity is largely attributed to the exceptional performance of their multi-head self-attention blocks when analyzing sequential data and extracting features. To date, there are limited hardware accelerators tailored for this mechanism, which is the first step before designing an accelerator for a complete model. This paper proposes extit{FAMOUS}, a flexible hardware accelerator for dense multi-head attention (MHA) computation of TNNs on field-programmable gate arrays (FPGAs). It is optimized for high utilization of processing elements and on-chip memories to improve parallelism and reduce latency. An efficient tiling of large matrices has been employed to distribute memory and computing resources across different modules on various FPGA platforms. The design is evaluated on Xilinx Alveo U55C and U200 data center cards containing Ultrascale+ FPGAs. Experimental results are presented that show that it can attain a maximum throughput, number of parallel attention heads, embedding dimension and tile size of 328 (giga operations/second (GOPS)), 8, 768 and 64 respectively on the U55C. Furthermore, it is 3.28$ imes$ and 2.6$ imes$ faster than the Intel Xeon Gold 5220R CPU and NVIDIA V100 GPU respectively. It is also 1.3$ imes$ faster than the fastest state-of-the-art FPGA-based accelerator.
Problem

Research questions and friction points this paper is trying to address.

Develop FPGA accelerator for Transformer attention mechanism
Optimize processing elements and memory utilization
Achieve higher throughput than CPU and GPU
Innovation

Methods, ideas, or system contributions that make the work stand out.

Flexible FPGA accelerator for attention mechanism
Optimized processing elements and on-chip memory
Efficient matrix tiling for resource distribution
🔎 Similar Papers
No similar papers found.