ClusterFusion: Expanding Operator Fusion Scope for LLM Inference via Cluster-Level Collective Primitive

📅 2025-08-26

📈 Citations: 0

✨ Influential: 0

career value

247K/year

🤖 AI Summary

Operator fragmentation and frequent off-chip memory accesses during LLM decoding cause high latency and memory bandwidth bottlenecks. Method: We propose a novel operator fusion paradigm for GPU clusters, introducing two structured cluster-level communication primitives—ClusterReduce and ClusterGather—that abstract on-chip collective communication across thread blocks, enabling intermediate tensors to remain entirely on-die. Leveraging NVIDIA Hopper’s distributed shared memory and low-latency NVLink, we design a joint communication-computation scheduling framework that fuses QKV projection, attention computation, and output projection into a single kernel for the first time. Results: On H100 GPUs, our approach reduces end-to-end decoding latency by 1.61× on average, significantly outperforming state-of-the-art inference systems. Our core contribution lies in transcending the traditional on-chip boundary of operator fusion by establishing structured communication abstractions and an execution model that enables cross-SM collaboration.

Technology Category

Application Category

📝 Abstract

Large language model (LLM) decoding suffers from high latency due to fragmented execution across operators and heavy reliance on off-chip memory for data exchange and reduction. This execution model limits opportunities for fusion and incurs significant memory traffic and kernel launch overhead. While modern architectures such as NVIDIA Hopper provide distributed shared memory and low-latency intra-cluster interconnects, they expose only low-level data movement instructions, lacking structured abstractions for collective on-chip communication. To bridge this software-hardware gap, we introduce two cluster-level communication primitives, ClusterReduce and ClusterGather, which abstract common communication patterns and enable structured, high-speed data exchange and reduction between thread blocks within a cluster, allowing intermediate results to be on-chip without involving off-chip memory. Building on these abstractions, we design ClusterFusion, an execution framework that schedules communication and computation jointly to expand operator fusion scope by composing decoding stages such as QKV Projection, Attention, and Output Projection into a single fused kernels. Evaluations on H100 GPUs show that ClusterFusion outperforms state-of-the-art inference frameworks by 1.61x on average in end-to-end latency across different models and configurations. The source code is available at https://github.com/xinhao-luo/ClusterFusion.

Problem

Research questions and friction points this paper is trying to address.

Reducing LLM decoding latency from fragmented operator execution

Minimizing off-chip memory reliance for data exchange and reduction

Expanding operator fusion scope using cluster-level communication primitives

Innovation

Methods, ideas, or system contributions that make the work stand out.

ClusterReduce and ClusterGather primitives for on-chip communication

Joint scheduling of communication and computation operations

Fusing multiple decoding stages into single kernel

🔎 Similar Papers

Distributed Rule Vectors is A Key Mechanism in Large Language Models' In-Context Learning