Large Language Model Inference Acceleration: A Comprehensive Hardware Perspective

📅 2024-10-06
🏛️ arXiv.org
📈 Citations: 3
Influential: 0
📄 PDF
🤖 AI Summary
Hardware adaptation bottlenecks hinder efficient LLM deployment at the edge. Method: We systematically survey and uniformly benchmark state-of-the-art inference acceleration techniques—including pruning, quantization, KV cache optimization, and operator fusion—across CPU, GPU, FPGA, ASIC, and processing-in-memory (PIM) platforms, using identical models and methods under batch sizes of 1 and 8. Our evaluation quantifies throughput (tokens/s) and energy efficiency (tokens/J). Contribution/Results: ASIC and PIM architectures achieve substantially higher energy efficiency than conventional platforms. Based on empirical findings, we identify three key evolutionary directions for edge AI: native multimodal support, dynamic runtime computation scheduling, and orders-of-magnitude improvement in inference capability per unit energy. This work provides empirically grounded guidance for hardware selection and hardware-software co-optimization of LLMs in resource-constrained edge environments.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) have demonstrated remarkable capabilities across various fields, from natural language understanding to text generation. Compared to non-generative LLMs like BERT and DeBERTa, generative LLMs like GPT series and Llama series are currently the main focus due to their superior algorithmic performance. The advancements in generative LLMs are closely intertwined with the development of hardware capabilities. Various hardware platforms exhibit distinct hardware characteristics, which can help improve LLM inference performance. Therefore, this paper comprehensively surveys efficient generative LLM inference on different hardware platforms. First, we provide an overview of the algorithm architecture of mainstream generative LLMs and delve into the inference process. Then, we summarize different optimization methods for different platforms such as CPU, GPU, FPGA, ASIC, and PIM/NDP, and provide inference results for generative LLMs. Furthermore, we perform a qualitative and quantitative comparison of inference performance with batch sizes 1 and 8 on different hardware platforms by considering hardware power consumption, absolute inference speed (tokens/s), and energy efficiency (tokens/J). We compare the performance of the same optimization methods across different hardware platforms, the performance across different hardware platforms, and the performance of different methods on the same hardware platform. This provides a systematic and comprehensive summary of existing inference acceleration work by integrating software optimization methods and hardware platforms. We point out that three trends (multimodality, inference-time compute, and higher inference energy efficiency) are promising to redefine the capabilities of edge artificial intelligence systems. Our project is available at https://dai.sjtu.edu.cn/project.html.
Problem

Research questions and friction points this paper is trying to address.

Large Language Model Optimization
Hardware Adaptation
Energy Efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large Language Models Optimization
Hardware-Aware Efficiency
Energy Efficiency in AI
🔎 Similar Papers
No similar papers found.
J
Jinhao Li
Shanghai Jiao Tong University, China
J
Jiaming Xu
Shanghai Jiao Tong University & Infinigence-AI, China
S
Shan Huang
Shanghai Jiao Tong University, China
Yonghua Chen
Yonghua Chen
Infinigence-AI, China
W
Wen Li
Infinigence-AI, China
J
Jun Liu
Shanghai Jiao Tong University, China
Yaoxiu Lian
Yaoxiu Lian
Shanghai Jong Tong university
J
Jiayi Pan
Shanghai Jiao Tong University, China
L
Li Ding
Shanghai Jiao Tong University, China
H
Hao Zhou
Shanghai Jiao Tong University, China
Y
Yu Wang
Tsinghua University, China
Guohao Dai
Guohao Dai
Associate Professor of Shanghai Jiao Tong University
Sparse ComputationLarge-scale Graph ProcessingFPGACircuits and Systems