Towards Compute-Aware In-Switch Computing for LLMs Tensor-Parallelism on Multi-GPU Systems

📅 2026-05-06

📈 Citations: 0

✨ Influential: 0

career value

233K/year

🤖 AI Summary

Existing in-network computing approaches based on NVLink SHARP suffer from inefficient overlap between communication and computation due to mismatches in their memory semantics, leading to suboptimal resource utilization in multi-GPU systems. This work proposes CAIS, the first compute-aware in-network computing framework, which aligns the memory semantics of large language model (LLM) computation kernels with communication through three key innovations: a compute-aware instruction set and microarchitectural extensions, a coalescing-aware thread block coordination mechanism, and a graph-level dataflow optimizer. By breaking away from conventional communication-centric designs, CAIS achieves significant performance gains—accelerating LLM training by 1.38× over the state-of-the-art NVLS-based approach and by 1.61× compared to the T3 baseline without NVLS.

📝 Abstract

Tensor parallelism (TP) in large-scale LLM inference and training introduces frequent collective operations that dominate inter-GPU communication. While in-switch computing, exemplified by NVLink SHARP (NVLS), accelerates collective operations by reducing redundant data transfer, its communication-centric design philosophy introduces the mismatch between its communication mode and the memory semantic requirement of LLM's computation kernel. Such a mismatch isolates the compute and communication phases, resulting in underutilized resources and limited overlap in multi-GPU systems. To address the limitation, we propose CAIS, the first Compute-Aware In-Switch computing framework that aligns communication modes with computation's memory semantics requirement. CAIS consists of three integral techniques: (1) compute-aware ISA and microarchitecture extension to enable compute-aware in-switch computing. (2) merging-aware TB (Thread Block) coordination to improve the temporal alignment for efficient request merging. (3) graph-level dataflow optimizer to achieve a tight cross-kernel overlap. Evaluations on LLM workloads show that CAIS achieves 1.38$\times$ average end-to-end training speedup over the SOTA NVLS-enabled solution, and 1.61$\times$ over T3, the SOTA compute-communicate overlap solutions but do not leverage NVLS, demonstrating its effectiveness in accelerating TP on multi-GPU systems.

Problem

Research questions and friction points this paper is trying to address.

Tensor Parallelism

In-Switch Computing

Memory Semantics

Multi-GPU Systems

Compute-Communication Overlap

Innovation

Methods, ideas, or system contributions that make the work stand out.

Compute-Aware In-Switch Computing

Tensor Parallelism

NVLink SHARP