Pushing up to the Limit of Memory Bandwidth and Capacity Utilization for Efficient LLM Decoding on Embedded FPGA

📅 2025-02-15

📈 Citations: 0

✨ Influential: 0

career value

233K/year

🤖 AI Summary

To address the challenge of inefficient large language model (LLM) decoding on resource-constrained embedded FPGA edge devices—particularly due to limited memory bandwidth and capacity—this work presents the first end-to-end deployment of a 7B-parameter LLaMA2 model on a bare-metal Zynq KV260 platform (4 GB DDR4) without an operating system. We propose a custom operator-fusion dataflow architecture tailored for LLM decoding, coupled with a high-transaction-efficiency data layout, DDR4 bandwidth-aware scheduling, lightweight hardware accelerators, and a fully pipelined execution engine. Experimental results demonstrate a real-time decoding throughput of approximately 5 tokens/second, 93.3% DRAM capacity utilization, and DDR4 bandwidth utilization reaching 85% of its theoretical peak—marking a significant breakthrough in overcoming resource bottlenecks for LLM inference on embedded FPGAs.

Technology Category

Application Category

📝 Abstract

The extremely high computational and storage demands of large language models have excluded most edge devices, which were widely used for efficient machine learning, from being viable options. A typical edge device usually only has 4GB of memory capacity and a bandwidth of less than 20GB/s, while a large language model quantized to 4-bit precision with 7B parameters already requires 3.5GB of capacity, and its decoding process is purely bandwidth-bound. In this paper, we aim to explore these limits by proposing a hardware accelerator for large language model (LLM) inference on the Zynq-based KV260 platform, equipped with 4GB of 64-bit 2400Mbps DDR4 memory. We successfully deploy a LLaMA2-7B model, achieving a decoding speed of around 5 token/s, utilizing 93.3% of the memory capacity and reaching 85% decoding speed of the theoretical memory bandwidth limit. To fully reserve the memory capacity for model weights and key-value cache, we develop the system in a bare-metal environment without an operating system. To fully reserve the bandwidth for model weight transfers, we implement a customized dataflow with an operator fusion pipeline and propose a data arrangement format that can maximize the data transaction efficiency. This research marks the first attempt to deploy a 7B level LLM on a standalone embedded field programmable gate array (FPGA) device. It provides key insights into efficient LLM inference on embedded FPGA devices and provides guidelines for future architecture design.

Problem

Research questions and friction points this paper is trying to address.

Optimize LLM decoding on edge devices

Maximize memory bandwidth and capacity

Deploy 7B LLM on embedded FPGA

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hardware accelerator for LLM inference

Bare-metal system optimization

Customized dataflow and operator fusion

🔎 Similar Papers

Turning Trash into Treasure: Accelerating Inference of Large Language Models with Token Recycling