ClusterFusion++: Expanding Cluster-Level Fusion to Full Transformer-Block Decoding

📅 2026-04-26
📈 Citations: 0
Influential: 0
📄 PDF

career value

236K/year
🤖 AI Summary
This work addresses the significant latency in large language model decoding caused by operator fragmentation and frequent off-chip storage of intermediate tensors. It presents the first implementation of fine-grained operator fusion across an entire Transformer decoder block, integrating LayerNorm, QKV projection, RoPE, attention computation, output projection, MLP, and residual connections. The approach leverages CUDA thread block clusters, on-chip collective communication, persistent TMA descriptors, and CUDA Graph optimizations. Evaluated on an NVIDIA RTX 5090-class GPU, the method achieves a 1.34× throughput improvement for Pythia-2.8B and a comparable speedup for Pythia-6.9B, while preserving near-per-token output fidelity.

Technology Category

Application Category

📝 Abstract
Large language model (LLM) decoding is latency-sensitive and often bottlenecked by fragmented operator execution and repeated off-chip materialization of intermediate tensors. Prior work expands fusion scope by leveraging thread-block clusters and on-chip inter-block collectives to fuse attention-side operators such as QKV projection, attention, and output projection. We develop ClusterFusion++, a CUDA-level extension that broadens fusion to the full Transformer decoder block for GPT-NeoX/Pythia models: LayerNorm -> QKV -> RoPE -> decode attention -> output projection -> Post-LN -> MLP -> residual. We additionally engineer a CUDA-Graph-compatible execution mode with persistent Tensor Memory Accelerator (TMA) descriptors to reduce per-step overhead. On an NVIDIA RTX 5090-class GPU, ClusterFusion++ improves throughput by 1.34x for Pythia-2.8B and yields similar gains for Pythia-6.9B, while maintaining high output fidelity (near-token-identical generation, with minor non-determinism from FP16 atomics).
Problem

Research questions and friction points this paper is trying to address.

LLM decoding
latency
operator fusion
intermediate tensor materialization
performance bottleneck
Innovation

Methods, ideas, or system contributions that make the work stand out.

cluster-level fusion
full Transformer-block decoding
CUDA Graph
Tensor Memory Accelerator (TMA)
on-chip inter-block collectives
🔎 Similar Papers
No similar papers found.