EdgeLLM: A Highly Efficient CPU-FPGA Heterogeneous Edge Accelerator for Large Language Models

📅 2024-07-31

🏛️ arXiv.org

📈 Citations: 4

✨ Influential: 0

career value

271K/year

🤖 AI Summary

To address the challenge of deploying large language models (LLMs) on resource-constrained edge devices (e.g., robots), this work proposes a CPU–FPGA heterogeneous acceleration system built on the AMD Xilinx VCU128 platform. It introduces a novel mixed-precision group-level systolic array: FP16×FP16 computation for multi-head attention (MHA) layers and FP16×INT4 co-processing for feed-forward network (FFN) layers. Additionally, it proposes a log-scale structured weight quantization scheme and a unified data-parallel compilation framework that eliminates inter-operator data reordering. Experimental results show that, compared to an NVIDIA A100 GPU, the system achieves 1.91× higher throughput and 7.55× better energy efficiency. Against the state-of-the-art FPGA-based FlightLLM, it improves HBM bandwidth utilization, energy efficiency, and LLM throughput by 10–24%. This work establishes a scalable hardware–algorithm co-optimization paradigm for low-power, high-real-time LLM inference at the edge.

Technology Category

Application Category

📝 Abstract

The rapid advancements in artificial intelligence (AI), particularly the Large Language Models (LLMs), have profoundly affected our daily work and communication forms. However, it is still a challenge to deploy LLMs on resource-constrained edge devices (such as robots), due to the intensive computation requirements, heavy memory access, diverse operator types and difficulties in compilation. In this work, we proposed EdgeLLM to address the above issues. Firstly, focusing on the computation, we designed mix-precision processing element array together with group systolic architecture, that can efficiently support both FP16*FP16 for the MHA block (Multi-Head Attention) and FP16*INT4 for the FFN layer (Feed-Forward Network). Meanwhile specific optimization on log-scale structured weight sparsity, has been used to further increase the efficiency. Secondly, to address the compilation and deployment issue, we analyzed the whole operators within LLM models and developed a universal data parallelism scheme, by which all of the input and output features maintain the same data shape, enabling to process different operators without any data rearrangement. Then we proposed an end-to-end compiler to map the whole LLM model on CPU-FPGA heterogeneous system (AMD Xilinx VCU128 FPGA). The accelerator achieves 1.91x higher throughput and 7.55x higher energy efficiency than the commercial GPU (NVIDIA A100-SXM4-80G). When compared with state-of-the-art FPGA accelerator of FlightLLM, it shows 10-24% better performance in terms of HBM bandwidth utilization, energy efficiency and LLM throughput.

Problem

Research questions and friction points this paper is trying to address.

Deploying LLMs on resource-constrained edge devices

Efficient computation and memory access for LLMs

Compilation and deployment challenges for LLMs on CPU-FPGA systems

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mix-precision processing element array

Log-scale structured weight sparsity optimization

Universal data parallelism scheme

🔎 Similar Papers

Large Language Model Inference Acceleration: A Comprehensive Hardware Perspective