🤖 AI Summary
Existing GPU low-precision computing frameworks are constrained by power-of-two bit-width requirements and high-level abstractions that hinder register-level and memory-access optimizations, limiting both programming flexibility and hardware performance. This paper introduces the first general-purpose GPU virtual machine supporting arbitrary-bit-width (non-power-of-two) low-precision data types. Our approach breaks abstraction-layer limitations via a thread-block-level domain-specific language (DSL), hierarchical memory spaces, an algebraic layout system, and fine-grained register management. Integrated with automatic vectorization compilation and custom instruction selection, it generates kernels matching hand-optimized performance. Evaluated across a full spectrum of low-precision types, our framework consistently outperforms Triton, Ladder, QuantLLM, and Marlin—achieving up to 2.61× speedup—and significantly improves LLM inference throughput and energy efficiency.
📝 Abstract
Serving Large Language Models (LLMs) is critical for AI-powered applications but demands substantial computational resources, particularly in memory bandwidth and computational throughput. Low-precision computation has emerged as a key technique to improve efficiency while reducing resource consumption. Existing approaches for generating low-precision kernels are limited to weight bit widths that are powers of two and suffer from suboptimal performance due to high-level GPU programming abstractions. These abstractions restrict critical optimizations, such as fine-grained register management and optimized memory access patterns, which are essential for efficient low-precision computations. In this paper, we introduce a virtual machine (VM) designed for General-Purpose GPU (GPGPU) computing, enabling support for low-precision data types with arbitrary bit widths while maintaining GPU programmability. The proposed VM features a thread-block-level programming model, a hierarchical memory space, a novel algebraic layout system, and extensive support for diverse low-precision data types. VM programs are compiled into highly efficient GPU programs with automatic vectorization and instruction selection. Extensive experiments demonstrate that our VM efficiently supports a full spectrum of low-precision data types, and outperforms state-of-the-art low-precision kernels on their supported types. Compared to existing compilers like Triton and Ladder, as well as hand-optimized kernels such as QuantLLM and Marlin, our VM achieves performance improvements of 1.75x, 2.61x, 1.29x and 1.03x, respectively.