Pushing up to the Limit of Memory Bandwidth and Capacity Utilization for Efficient LLM Decoding on Embedded FPGA

📅 2025-02-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of inefficient large language model (LLM) decoding on resource-constrained embedded FPGA edge devices—particularly due to limited memory bandwidth and capacity—this work presents the first end-to-end deployment of a 7B-parameter LLaMA2 model on a bare-metal Zynq KV260 platform (4 GB DDR4) without an operating system. We propose a custom operator-fusion dataflow architecture tailored for LLM decoding, coupled with a high-transaction-efficiency data layout, DDR4 bandwidth-aware scheduling, lightweight hardware accelerators, and a fully pipelined execution engine. Experimental results demonstrate a real-time decoding throughput of approximately 5 tokens/second, 93.3% DRAM capacity utilization, and DDR4 bandwidth utilization reaching 85% of its theoretical peak—marking a significant breakthrough in overcoming resource bottlenecks for LLM inference on embedded FPGAs.

Technology Category

Application Category

📝 Abstract
The extremely high computational and storage demands of large language models have excluded most edge devices, which were widely used for efficient machine learning, from being viable options. A typical edge device usually only has 4GB of memory capacity and a bandwidth of less than 20GB/s, while a large language model quantized to 4-bit precision with 7B parameters already requires 3.5GB of capacity, and its decoding process is purely bandwidth-bound. In this paper, we aim to explore these limits by proposing a hardware accelerator for large language model (LLM) inference on the Zynq-based KV260 platform, equipped with 4GB of 64-bit 2400Mbps DDR4 memory. We successfully deploy a LLaMA2-7B model, achieving a decoding speed of around 5 token/s, utilizing 93.3% of the memory capacity and reaching 85% decoding speed of the theoretical memory bandwidth limit. To fully reserve the memory capacity for model weights and key-value cache, we develop the system in a bare-metal environment without an operating system. To fully reserve the bandwidth for model weight transfers, we implement a customized dataflow with an operator fusion pipeline and propose a data arrangement format that can maximize the data transaction efficiency. This research marks the first attempt to deploy a 7B level LLM on a standalone embedded field programmable gate array (FPGA) device. It provides key insights into efficient LLM inference on embedded FPGA devices and provides guidelines for future architecture design.
Problem

Research questions and friction points this paper is trying to address.

Optimize LLM decoding on edge devices
Maximize memory bandwidth and capacity
Deploy 7B LLM on embedded FPGA
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hardware accelerator for LLM inference
Bare-metal system optimization
Customized dataflow and operator fusion
J
Jindong Li
Brain-inspired Cognitive Intelligence Lab, Institute of Automation, Chinese Academy of Sciences; Center for Long-term Artificial Intelligence; School of Artificial Intelligence, University of Chinese Academy of Sciences
Tenglong Li
Tenglong Li
Institute of Automation, Chinese Academy of Sciences
Hardware Architecture
G
Guobin Shen
Brain-inspired Cognitive Intelligence Lab, Institute of Automation, Chinese Academy of Sciences; Center for Long-term Artificial Intelligence; School of Future Technology; School of Artificial Intelligence, University of Chinese Academy of Sciences
Dongcheng Zhao
Dongcheng Zhao
Beijing Institute of AI Safety and Governance
Spiking Neural NetworksEvent Based VisionBrain-inspired AILLM Safety
Q
Qian Zhang
Brain-inspired Cognitive Intelligence Lab, Institute of Automation, Chinese Academy of Sciences; Center for Long-term Artificial Intelligence; School of Artificial Intelligence, University of Chinese Academy of Sciences
Y
Yi Zeng
Brain-inspired Cognitive Intelligence Lab, Institute of Automation, Chinese Academy of Sciences; Center for Long-term Artificial Intelligence; School of Future Technology; School of Artificial Intelligence, University of Chinese Academy of Sciences; Key Laboratory of Brain Cognition and Brain-inspired Intelligence Technology, Chinese Academy of Sciences