LLM Inference Beyond a Single Node: From Bottlenecks to Mitigations with Fast All-Reduce Communication

πŸ“… 2025-11-12
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address the performance bottleneck caused by all-reduce communication in multi-node distributed inference of large language models (LLMs), this paper proposes NVRARβ€”a hierarchical all-reduce algorithm that integrates NVSHMEM with recursive doubling. NVRAR bypasses NCCL’s conventional cross-node communication path, enabling low-latency, high-bandwidth inter-node synchronization on HPE Slingshot and InfiniBand supercomputing infrastructures. Experimental results show that NVRAR reduces communication latency by 1.9×–3.6Γ— over NCCL for message sizes ranging from 128 KB to 2 MB. For end-to-end batched inference of the Llama 3.1 405B model, it achieves up to a 1.72Γ— reduction in latency. This work represents the first deep integration of NVSHMEM into all-reduce optimization for LLM inference, significantly improving strong scaling efficiency of multi-node tensor parallelism.

Technology Category

Application Category

πŸ“ Abstract
As large language models (LLMs) continue to grow in size, distributed inference has become increasingly important. Model-parallel strategies must now efficiently scale not only across multiple GPUs but also across multiple nodes. In this work, we present a detailed performance study of multi-node distributed inference using LLMs on GPU-based supercomputers. We conduct experiments with several state-of-the-art inference engines alongside YALIS, a research-oriented prototype engine designed for controlled experimentation. We analyze the strong-scaling behavior of different model-parallel schemes and identify key bottlenecks. Since all-reduce operations are a common performance bottleneck, we develop NVRAR, a hierarchical all-reduce algorithm based on recursive doubling with NVSHMEM. NVRAR achieves up to 1.9x-3.6x lower latency than NCCL for message sizes between 128 KB and 2 MB on HPE Slingshot and InfiniBand interconnects. Integrated into YALIS, NVRAR achieves up to a 1.72x reduction in end-to-end batch latency for the Llama 3.1 405B model in multi-node decode-heavy workloads using tensor parallelism.
Problem

Research questions and friction points this paper is trying to address.

Addressing communication bottlenecks in multi-node LLM distributed inference
Developing fast hierarchical all-reduce algorithms for model parallelism
Optimizing end-to-end latency for large language models across nodes
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical all-reduce algorithm using NVSHMEM
Recursive doubling technique for communication optimization
Multi-node tensor parallelism for distributed inference
πŸ”Ž Similar Papers
No similar papers found.
Prajwal Singhania
Prajwal Singhania
PhD Student, Department of Computer Science, UMD
Computer ScienceHigh-Performance ComputingComputer SystemsAI/ML
Siddharth Singh
Siddharth Singh
Research Scientist at Nvidia
High Performance ComputingArtificial Intelligence
L
Lannie Dalton Hough
Department of Computer Science, University of Maryland
A
Akarsh Srivastava
Department of Computer Science, University of Maryland
Harshitha Menon
Harshitha Menon
Lawrence Livermore Nationa Lab
Parallel Computing
C
C. Jekel
Lawrence Livermore National Laboratory
A
A. Bhatele
Department of Computer Science, University of Maryland