A Virtual Machine for Arbitrary Low-Precision GPGPU Computation in LLM Serving

📅 2025-04-17

📈 Citations: 0

✨ Influential: 0

career value

227K/year

🤖 AI Summary

Existing GPU low-precision computing frameworks are constrained by power-of-two bit-width requirements and high-level abstractions that hinder register-level and memory-access optimizations, limiting both programming flexibility and hardware performance. This paper introduces the first general-purpose GPU virtual machine supporting arbitrary-bit-width (non-power-of-two) low-precision data types. Our approach breaks abstraction-layer limitations via a thread-block-level domain-specific language (DSL), hierarchical memory spaces, an algebraic layout system, and fine-grained register management. Integrated with automatic vectorization compilation and custom instruction selection, it generates kernels matching hand-optimized performance. Evaluated across a full spectrum of low-precision types, our framework consistently outperforms Triton, Ladder, QuantLLM, and Marlin—achieving up to 2.61× speedup—and significantly improves LLM inference throughput and energy efficiency.

Technology Category

Application Category

📝 Abstract

Serving Large Language Models (LLMs) is critical for AI-powered applications but demands substantial computational resources, particularly in memory bandwidth and computational throughput. Low-precision computation has emerged as a key technique to improve efficiency while reducing resource consumption. Existing approaches for generating low-precision kernels are limited to weight bit widths that are powers of two and suffer from suboptimal performance due to high-level GPU programming abstractions. These abstractions restrict critical optimizations, such as fine-grained register management and optimized memory access patterns, which are essential for efficient low-precision computations. In this paper, we introduce a virtual machine (VM) designed for General-Purpose GPU (GPGPU) computing, enabling support for low-precision data types with arbitrary bit widths while maintaining GPU programmability. The proposed VM features a thread-block-level programming model, a hierarchical memory space, a novel algebraic layout system, and extensive support for diverse low-precision data types. VM programs are compiled into highly efficient GPU programs with automatic vectorization and instruction selection. Extensive experiments demonstrate that our VM efficiently supports a full spectrum of low-precision data types, and outperforms state-of-the-art low-precision kernels on their supported types. Compared to existing compilers like Triton and Ladder, as well as hand-optimized kernels such as QuantLLM and Marlin, our VM achieves performance improvements of 1.75x, 2.61x, 1.29x and 1.03x, respectively.

Problem

Research questions and friction points this paper is trying to address.

Enables low-precision GPGPU computation with arbitrary bit widths

Overcomes limitations of existing low-precision kernel generation approaches

Improves performance and efficiency for LLM serving workloads

Innovation

Methods, ideas, or system contributions that make the work stand out.

VM for arbitrary low-precision GPGPU computation

Thread-block-level programming model and memory hierarchy

Automatic vectorization and instruction selection for GPU

🔎 Similar Papers

No similar papers found.