APT-LLM: Exploiting Arbitrary-Precision Tensor Core Computing for LLM Acceleration

📅 2025-08-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing GPU-accelerated inference for ultra-low-bit quantized large language models (LLMs) is hindered by Tensor Core limitations in narrow-bit support, low memory efficiency, and inflexible kernel designs. To address these challenges, this work proposes an end-to-end acceleration framework: (1) a bipolar-INT integer format enabling lossless signed-integer representation at ultra-low precision; (2) a novel bit-level matrix decomposition and recomposition multiplication scheme supporting arbitrary bit-widths (≤8-bit); and (3) a dynamic kernel mapping and shared-memory-optimized memory management system. Evaluated on RTX 3090/4090 and H800 GPUs, the framework achieves up to 3.99× speedup over FP16 baselines and 2.16× over CUTLASS INT4. It significantly overcomes hardware bottlenecks in supporting arbitrary-precision quantization, enabling efficient, flexible, and high-throughput LLM inference across diverse low-bit configurations.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) have revolutionized AI applications, yet their enormous computational demands severely limit deployment and real-time performance. Quantization methods can help reduce computational costs, however, attaining the extreme efficiency associated with ultra-low-bit quantized LLMs at arbitrary precision presents challenges on GPUs. This is primarily due to the limited support for GPU Tensor Cores, inefficient memory management, and inflexible kernel optimizations. To tackle these challenges, we propose a comprehensive acceleration scheme for arbitrary precision LLMs, namely APT-LLM. Firstly, we introduce a novel data format, bipolar-INT, which allows for efficient and lossless conversion with signed INT, while also being more conducive to parallel computation. We also develop a matrix multiplication (MatMul) method allowing for arbitrary precision by dismantling and reassembling matrices at the bit level. This method provides flexible precision and optimizes the utilization of GPU Tensor Cores. In addition, we propose a memory management system focused on data recovery, which strategically employs fast shared memory to substantially increase kernel execution speed and reduce memory access latency. Finally, we develop a kernel mapping method that dynamically selects the optimal configurable hyperparameters of kernels for varying matrix sizes, enabling optimal performance across different LLM architectures and precision settings. In LLM inference, APT-LLM achieves up to a 3.99$ imes$ speedup compared to FP16 baselines and a 2.16$ imes$ speedup over NVIDIA CUTLASS INT4 acceleration on RTX 3090. On RTX 4090 and H800, APT-LLM achieves up to 2.44$ imes$ speedup over FP16 and 1.65$ imes$ speedup over CUTLASS integer baselines.
Problem

Research questions and friction points this paper is trying to address.

Accelerating LLM inference with arbitrary precision quantization
Overcoming GPU Tensor Core limitations for low-bit computation
Optimizing memory management and kernel performance on GPUs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Novel bipolar-INT data format for parallel computation
Bit-level matrix multiplication method for arbitrary precision
Dynamic kernel mapping with optimal hyperparameter selection
🔎 Similar Papers
No similar papers found.
S
Shaobo Ma
School of Electronic Science and Engineering, Nanjing University
Chao Fang
Chao Fang
Shanghai Qi Zhi Institute
efficient MLAI acceleratorhardware-software co-designprecision-scalable computingRISC-V
H
Haikuo Shao
School of Electronic Science and Engineering, Nanjing University
Zhongfeng Wang
Zhongfeng Wang
Nanjing University
VLSIFECDSPMIMONeural Network