A Systematic Characterization of LLM Inference on GPUs

📅 2025-12-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing work lacks a systematic understanding of large language model (LLM) inference behavior on GPUs. Method: This paper introduces the first four-dimensional analytical framework—spanning two-phase computational heterogeneity, microarchitectural-level performance root causes, system-scale scaling laws, and boundaries of emerging inference paradigms—validated via large-scale empirical measurement, deep GPU microarchitectural analysis, fine-grained performance modeling, and cross-architecture scalability evaluation across mainstream GPUs (A100/H100) and LLMs (7B–70B). Contribution/Results: The study uncovers latent bottlenecks between attention computation and memory access, identifies previously unrecognized critical constraints, and establishes hardware-aware theoretical performance bounds and deployable optimization strategies. It fills a fundamental gap in system-level LLM inference analysis and enables the design of efficient, scalable next-generation inference systems.

Technology Category

Application Category

📝 Abstract
This work presents a systematic characterization of Large Language Model (LLM) inference to address fragmented understanding. Through comprehensive experiments, we establish a four-dimensional analytical framework: (1) Two-Phase Heterogeneity Observation; (2) Microarchitectural Root Cause Analysis; (3) System Scaling Principles; and (4) Emerging Paradigm Boundaries. Our investigation progresses systematically from observation to foresight: identifying performance phenomena, revealing hardware causes, validating system behavior, and exploring new paradigms. This study not only consolidates a reliable empirical foundation for existing research but also provides new discoveries and practical optimization guidance for LLM inference.
Problem

Research questions and friction points this paper is trying to address.

Characterizes LLM inference performance on GPUs systematically
Establishes a four-dimensional analytical framework for understanding
Provides empirical foundation and optimization guidance for inference
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-phase heterogeneity observation for performance analysis
Microarchitectural root cause analysis of hardware bottlenecks
System scaling principles and emerging paradigm exploration
H
Haonan Wang
State Key Lab of Processors, Institute of Computing Technology, Chinese Academy of Sciences, China
X
Xuxin Xiao
State Key Lab of Processors, Institute of Computing Technology, Chinese Academy of Sciences, China
M
Mingyu Yan
State Key Lab of Processors, Institute of Computing Technology, Chinese Academy of Sciences, China
Z
Zhuoyuan Zhu
State Key Lab of Processors, Institute of Computing Technology, Chinese Academy of Sciences, China
Dengke Han
Dengke Han
Institute of Computing Technology, Chinese Academy of Sciences
graph-based hardware acceleratorhigh-throughput computer architecture
D
Duo Wang
State Key Lab of Processors, Institute of Computing Technology, Chinese Academy of Sciences, China
W
Wenming Li
State Key Lab of Processors, Institute of Computing Technology, Chinese Academy of Sciences, China
X
Xiaochun Ye
State Key Lab of Processors, Institute of Computing Technology, Chinese Academy of Sciences, China
C
Cunchen Hu
China Telecom Cloud Computing Research Institute, China
Hongyang Chen
Hongyang Chen
SUN YAT-SEN UNIVERSITY
SDNCloud ComputingMicroserviceAIOps
Guangyu Sun
Guangyu Sun
School of Integrated Circuits, Peking University
Computer ArchitectureDesign AutomationEmerging Memory