A Virtual Machine for Arbitrary Low-Precision GPGPU Computation in LLM Serving

📅 2025-04-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing GPU low-precision computing frameworks are constrained by power-of-two bit-width requirements and high-level abstractions that hinder register-level and memory-access optimizations, limiting both programming flexibility and hardware performance. This paper introduces the first general-purpose GPU virtual machine supporting arbitrary-bit-width (non-power-of-two) low-precision data types. Our approach breaks abstraction-layer limitations via a thread-block-level domain-specific language (DSL), hierarchical memory spaces, an algebraic layout system, and fine-grained register management. Integrated with automatic vectorization compilation and custom instruction selection, it generates kernels matching hand-optimized performance. Evaluated across a full spectrum of low-precision types, our framework consistently outperforms Triton, Ladder, QuantLLM, and Marlin—achieving up to 2.61× speedup—and significantly improves LLM inference throughput and energy efficiency.

Technology Category

Application Category

📝 Abstract
Serving Large Language Models (LLMs) is critical for AI-powered applications but demands substantial computational resources, particularly in memory bandwidth and computational throughput. Low-precision computation has emerged as a key technique to improve efficiency while reducing resource consumption. Existing approaches for generating low-precision kernels are limited to weight bit widths that are powers of two and suffer from suboptimal performance due to high-level GPU programming abstractions. These abstractions restrict critical optimizations, such as fine-grained register management and optimized memory access patterns, which are essential for efficient low-precision computations. In this paper, we introduce a virtual machine (VM) designed for General-Purpose GPU (GPGPU) computing, enabling support for low-precision data types with arbitrary bit widths while maintaining GPU programmability. The proposed VM features a thread-block-level programming model, a hierarchical memory space, a novel algebraic layout system, and extensive support for diverse low-precision data types. VM programs are compiled into highly efficient GPU programs with automatic vectorization and instruction selection. Extensive experiments demonstrate that our VM efficiently supports a full spectrum of low-precision data types, and outperforms state-of-the-art low-precision kernels on their supported types. Compared to existing compilers like Triton and Ladder, as well as hand-optimized kernels such as QuantLLM and Marlin, our VM achieves performance improvements of 1.75x, 2.61x, 1.29x and 1.03x, respectively.
Problem

Research questions and friction points this paper is trying to address.

Enables low-precision GPGPU computation with arbitrary bit widths
Overcomes limitations of existing low-precision kernel generation approaches
Improves performance and efficiency for LLM serving workloads
Innovation

Methods, ideas, or system contributions that make the work stand out.

VM for arbitrary low-precision GPGPU computation
Thread-block-level programming model and memory hierarchy
Automatic vectorization and instruction selection for GPU
🔎 Similar Papers
No similar papers found.
Yaoyao Ding
Yaoyao Ding
University of Toronto
Computer SystemsMachine LearningCompiler
Bohan Hou
Bohan Hou
PhD of Computer Science, Carnegie Mellon University
Machine LearningSystems
X
Xiao Zhang
University of Toronto, Toronto, Canada
A
Allan Lin
University of Waterloo, Toronto, Canada
T
Tianqi Chen
Carnegie Mellon University, Pittsburgh, USA
C
Cody Yu Hao
Anyscale, Santa Clara, USA
Y
Yida Wang
Amazon, Santa Clara, USA
Gennady Pekhimenko
Gennady Pekhimenko
University of Toronto
Computer ArchitectureSystemsSystems for MLMachine Learning