A Systematic Characterization of LLM Inference on GPUs

📅 2025-12-01

📈 Citations: 0

✨ Influential: 0

career value

255K/year

🤖 AI Summary

Existing work lacks a systematic understanding of large language model (LLM) inference behavior on GPUs. Method: This paper introduces the first four-dimensional analytical framework—spanning two-phase computational heterogeneity, microarchitectural-level performance root causes, system-scale scaling laws, and boundaries of emerging inference paradigms—validated via large-scale empirical measurement, deep GPU microarchitectural analysis, fine-grained performance modeling, and cross-architecture scalability evaluation across mainstream GPUs (A100/H100) and LLMs (7B–70B). Contribution/Results: The study uncovers latent bottlenecks between attention computation and memory access, identifies previously unrecognized critical constraints, and establishes hardware-aware theoretical performance bounds and deployable optimization strategies. It fills a fundamental gap in system-level LLM inference analysis and enables the design of efficient, scalable next-generation inference systems.

Technology Category

Application Category

📝 Abstract

This work presents a systematic characterization of Large Language Model (LLM) inference to address fragmented understanding. Through comprehensive experiments, we establish a four-dimensional analytical framework: (1) Two-Phase Heterogeneity Observation; (2) Microarchitectural Root Cause Analysis; (3) System Scaling Principles; and (4) Emerging Paradigm Boundaries. Our investigation progresses systematically from observation to foresight: identifying performance phenomena, revealing hardware causes, validating system behavior, and exploring new paradigms. This study not only consolidates a reliable empirical foundation for existing research but also provides new discoveries and practical optimization guidance for LLM inference.

Problem

Research questions and friction points this paper is trying to address.

Characterizes LLM inference performance on GPUs systematically

Establishes a four-dimensional analytical framework for understanding

Provides empirical foundation and optimization guidance for inference

Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-phase heterogeneity observation for performance analysis

Microarchitectural root cause analysis of hardware bottlenecks

System scaling principles and emerging paradigm exploration

🔎 Similar Papers

RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval