MSCCL++: Rethinking GPU Communication Abstractions for Cutting-edge AI Applications

📅 2025-04-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of balancing portability and high performance for communication libraries across heterogeneous AI accelerators (e.g., NVIDIA/AMD GPUs), this paper introduces a novel hierarchical, separation-of-concerns GPU communication abstraction. At the hardware layer, it defines minimal, customizable hardware primitives enabling co-design of software and silicon; at the software layer, it provides a unified, portable API with vendor-specific optimized implementations. Leveraging primitive-driven library design, hardware-aware collective communication optimizations, and cross-vendor interface standardization, the approach significantly reduces redundant development effort across applications. Experiments demonstrate up to 3.8× higher collective communication throughput compared to NCCL, RCCL, and MSCCL, and up to 15% acceleration on real-world AI inference workloads. The framework has been deployed in multiple Microsoft Azure AI services and officially integrated into AMD’s RCCL.

Technology Category

Application Category

📝 Abstract
Modern cutting-edge AI applications are being developed over fast-evolving, heterogeneous, nascent hardware devices. This requires frequent reworking of the AI software stack to adopt bottom-up changes from new hardware, which takes time for general-purpose software libraries. Consequently, real applications often develop custom software stacks optimized for their specific workloads and hardware. Custom stacks help quick development and optimization, but incur a lot of redundant efforts across applications in writing non-portable code. This paper discusses an alternative communication library interface for AI applications that offers both portability and performance by reducing redundant efforts while maintaining flexibility for customization. We present MSCCL++, a novel abstraction of GPU communication based on separation of concerns: (1) a primitive interface provides a minimal hardware abstraction as a common ground for software and hardware developers to write custom communication, and (2) higher-level portable interfaces and specialized implementations enable optimization for different hardware environments. This approach makes the primitive interface reusable across applications while enabling highly flexible optimization. Compared to state-of-the-art baselines (NCCL, RCCL, and MSCCL), MSCCL++ achieves speedups of up to 3.8$ imes$ for collective communication and up to 15% for real-world AI inference workloads. MSCCL++ is in production of multiple AI services provided by Microsoft Azure, and is also adopted by RCCL, the GPU collective communication library maintained by AMD. MSCCL++ is open-source and available at https://github.com/microsoft/mscclpp.
Problem

Research questions and friction points this paper is trying to address.

Reducing redundant efforts in custom AI software stacks
Providing portable GPU communication for diverse hardware
Balancing performance and flexibility in AI communication libraries
Innovation

Methods, ideas, or system contributions that make the work stand out.

Separation of concerns in GPU communication
Primitive interface for hardware abstraction
Higher-level portable interfaces for optimization
🔎 Similar Papers
No similar papers found.