Towards Compute-Aware In-Switch Computing for LLMs Tensor-Parallelism on Multi-GPU Systems

📅 2026-05-06
📈 Citations: 0
Influential: 0
📄 PDF

career value

234K/year
🤖 AI Summary
Existing in-network computing approaches based on NVLink SHARP suffer from inefficient overlap between communication and computation due to mismatches in their memory semantics, leading to suboptimal resource utilization in multi-GPU systems. This work proposes CAIS, the first compute-aware in-network computing framework, which aligns the memory semantics of large language model (LLM) computation kernels with communication through three key innovations: a compute-aware instruction set and microarchitectural extensions, a coalescing-aware thread block coordination mechanism, and a graph-level dataflow optimizer. By breaking away from conventional communication-centric designs, CAIS achieves significant performance gains—accelerating LLM training by 1.38× over the state-of-the-art NVLS-based approach and by 1.61× compared to the T3 baseline without NVLS.
📝 Abstract
Tensor parallelism (TP) in large-scale LLM inference and training introduces frequent collective operations that dominate inter-GPU communication. While in-switch computing, exemplified by NVLink SHARP (NVLS), accelerates collective operations by reducing redundant data transfer, its communication-centric design philosophy introduces the mismatch between its communication mode and the memory semantic requirement of LLM's computation kernel. Such a mismatch isolates the compute and communication phases, resulting in underutilized resources and limited overlap in multi-GPU systems. To address the limitation, we propose CAIS, the first Compute-Aware In-Switch computing framework that aligns communication modes with computation's memory semantics requirement. CAIS consists of three integral techniques: (1) compute-aware ISA and microarchitecture extension to enable compute-aware in-switch computing. (2) merging-aware TB (Thread Block) coordination to improve the temporal alignment for efficient request merging. (3) graph-level dataflow optimizer to achieve a tight cross-kernel overlap. Evaluations on LLM workloads show that CAIS achieves 1.38$\times$ average end-to-end training speedup over the SOTA NVLS-enabled solution, and 1.61$\times$ over T3, the SOTA compute-communicate overlap solutions but do not leverage NVLS, demonstrating its effectiveness in accelerating TP on multi-GPU systems.
Problem

Research questions and friction points this paper is trying to address.

Tensor Parallelism
In-Switch Computing
Memory Semantics
Multi-GPU Systems
Compute-Communication Overlap
Innovation

Methods, ideas, or system contributions that make the work stand out.

Compute-Aware In-Switch Computing
Tensor Parallelism
NVLink SHARP
Memory Semantics
Cross-Kernel Overlap
🔎 Similar Papers
No similar papers found.