Deep Kernel Fusion for Transformers

📅 2026-02-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the performance bottleneck in long-context large language model inference caused by the SwiGLU MLP module, whose large weights induce frequent high-bandwidth memory (HBM) accesses under memory-bandwidth-constrained conditions. To mitigate this issue, the authors propose DeepFusionKernel, a novel approach that integrates operator fusion, optimized cache reuse, and reduced HBM traffic to substantially lower memory access overhead. The method features adaptable fused kernels compatible with diverse models, inference configurations, and hardware platforms, coupled with a scheduling mechanism that delivers consistent acceleration independent of generation length. Experimental results demonstrate up to 13.2% and 9.7% inference speedup on H100 and A100 GPUs, respectively, significantly outperforming the SGLang baseline.

Technology Category

Application Category

📝 Abstract
Agentic LLM inference with long contexts is increasingly limited by memory bandwidth rather than compute. In this setting, SwiGLU MLP blocks, whose large weights exceed cache capacity, become a major yet under-optimized bottleneck. We propose DeepFusionKernel, a deeply fused kernel that cuts HBM traffic and boosts cache reuse, delivering up to 13.2% speedup on H100 and 9.7% on A100 over SGLang. Integrated with SGLang and paired with a kernel scheduler, DeepFusionKernel ensures consistent accelerations over generation lengths, while remaining adaptable to diverse models, inference configurations, and hardware platforms.
Problem

Research questions and friction points this paper is trying to address.

memory bandwidth bottleneck
SwiGLU MLP
long-context LLM inference
cache capacity
HBM traffic
Innovation

Methods, ideas, or system contributions that make the work stand out.

Deep Kernel Fusion
SwiGLU Optimization
Memory Bandwidth Efficiency
Cache Reuse
Transformer Inference Acceleration
🔎 Similar Papers
No similar papers found.