Diagonal-Tiled Mixed-Precision Attention for Efficient Low-Bit MXFP Inference

📅 2026-04-04

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

This work addresses the high computational and memory costs of attention mechanisms in large language model inference, which stem from their quadratic complexity and reliance on high-precision arithmetic. The authors propose a low-bit mixed-precision attention kernel based on the microscaling floating-point (MXFP) format, uniquely integrating diagonal blocking with MXFP to fuse two types of low-bit operations at the block level. A finely optimized kernel is implemented using Triton to fully exploit the hardware parallelism and memory characteristics of next-generation GPUs such as the NVIDIA B200. Experimental results demonstrate that the proposed method achieves substantial improvements in inference efficiency with negligible degradation in generation quality. The implementation has been open-sourced to facilitate reproducibility and further research.

Technology Category

Application Category

📝 Abstract

Transformer-based large language models (LLMs) have demonstrated remarkable performance across a wide range of real-world tasks, but their inference cost remains prohibitively high due to the quadratic complexity of attention and the memory bandwidth limitations of high-precision operations. In this work, we present a low-bit mixed-precision attention kernel using the microscaling floating-point (MXFP) data format, utilizing the computing capability on next-generation GPU architectures. Our Diagonal-Tiled Mixed-Precision Attention (DMA) incorporates two kinds of low-bit computation at the tiling-level, and is a delicate fused kernel implemented using Triton, exploiting hardware-level parallelism and memory efficiency to enable fast and efficient inference without compromising model performance. Extensive empirical evaluations on NVIDIA B200 GPUs show that our kernel maintains generation quality with negligible degradation, and meanwhile achieves significant speedup by kernel fusion. We release our code at https://github.com/yifu-ding/MP-Sparse-Attn.

Problem

Research questions and friction points this paper is trying to address.

low-bit inference

attention mechanism

memory bandwidth

computational cost

large language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

mixed-precision attention

MXFP

low-bit inference