FlashFormer: Whole-Model Kernels for Efficient Low-Batch Inference

📅 2025-05-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address memory bandwidth bottlenecks and kernel launch overhead—dominant efficiency barriers in single-batch large language model (LLM) inference for edge deployment and ultra-low-latency scenarios—this work proposes a full-model-level fused kernel design, breaking from conventional operator-level optimization paradigms. Our approach integrates CUDA whole-model kernels, cross-operator memory access coordination, and quantization-aware compilation. Evaluated under INT4/FP16 quantization across diverse LLM scales, it achieves up to 2.3× end-to-end speedup over state-of-the-art inference kernels, while significantly reducing first-token latency. This represents the first systematic effort to maximize end-to-end hardware utilization for low-batch Transformer inference, establishing a new, efficient, and scalable hardware-software co-optimization pathway tailored for resource-constrained, latency-critical environments.

Technology Category

Application Category

📝 Abstract
The size and compute characteristics of modern large language models have led to an increased interest in developing specialized kernels tailored for training and inference. Existing kernels primarily optimize for compute utilization, targeting the large-batch training and inference settings. However, low-batch inference, where memory bandwidth and kernel launch overheads contribute are significant factors, remains important for many applications of interest such as in edge deployment and latency-sensitive applications. This paper describes FlashFormer, a proof-of-concept kernel for accelerating single-batch inference for transformer-based large language models. Across various model sizes and quantizations settings, we observe nontrivial speedups compared to existing state-of-the-art inference kernels.
Problem

Research questions and friction points this paper is trying to address.

Optimizing low-batch inference for transformers
Reducing memory bandwidth and kernel overhead
Accelerating edge and latency-sensitive applications
Innovation

Methods, ideas, or system contributions that make the work stand out.

Specialized kernel for low-batch inference
Optimizes memory bandwidth and launch overhead
Accelerates single-batch transformer inference
🔎 Similar Papers
No similar papers found.