🤖 AI Summary
To address memory bandwidth bottlenecks and kernel launch overhead—dominant efficiency barriers in single-batch large language model (LLM) inference for edge deployment and ultra-low-latency scenarios—this work proposes a full-model-level fused kernel design, breaking from conventional operator-level optimization paradigms. Our approach integrates CUDA whole-model kernels, cross-operator memory access coordination, and quantization-aware compilation. Evaluated under INT4/FP16 quantization across diverse LLM scales, it achieves up to 2.3× end-to-end speedup over state-of-the-art inference kernels, while significantly reducing first-token latency. This represents the first systematic effort to maximize end-to-end hardware utilization for low-batch Transformer inference, establishing a new, efficient, and scalable hardware-software co-optimization pathway tailored for resource-constrained, latency-critical environments.
📝 Abstract
The size and compute characteristics of modern large language models have led to an increased interest in developing specialized kernels tailored for training and inference. Existing kernels primarily optimize for compute utilization, targeting the large-batch training and inference settings. However, low-batch inference, where memory bandwidth and kernel launch overheads contribute are significant factors, remains important for many applications of interest such as in edge deployment and latency-sensitive applications. This paper describes FlashFormer, a proof-of-concept kernel for accelerating single-batch inference for transformer-based large language models. Across various model sizes and quantizations settings, we observe nontrivial speedups compared to existing state-of-the-art inference kernels.