NCCLbpf: Verified, Composable Policy Execution for GPU Collective Communication

πŸ“… 2026-03-11
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses critical reliability and safety issues in NCCL plugins, which execute unverified native code and are prone to task crashes, silent state corruption, and downtime during policy updates. The authors propose a userspace eBPF-based runtime mechanism that seamlessly integrates with NCCL’s existing plugin interface, enabling secure, efficient, and composable communication policy execution without modifying NCCL itself. By leveraging static verification at load time, structured cross-plugin shared maps, and atomic hot-reload capabilities, the system guarantees policy safety, composability, and zero-downtime updates. Experimental results on an 8Γ—NVIDIA B300 GPU system demonstrate that the approach incurs only 80–130 nanoseconds of overhead per decision, completely prevents unsafe behaviors observed in testing, and improves AllReduce throughput by up to 27% across message sizes ranging from 4 to 128 MiB.

Technology Category

Application Category

πŸ“ Abstract
NCCL is the de facto standard for collective GPU communication in large-scale distributed training, relying heavily on plugins to customize runtime behavior. However, these plugins execute as unverified native code within NCCL's address space, risking job crashes, silent state corruption, and downtime from restarts during policy updates. Inspired by kernel extensibility models, we introduce NCCLbpf, a verified, high-performance extension framework embedding a userspace eBPF runtime directly into NCCL's existing plugin interfaces, without modifying NCCL itself. NCCLbpf offers load-time static verification to prevent unsafe plugin execution, structured cross-plugin maps enabling composable policies and closed-loop adaptation, and atomic policy hot-reloads eliminating downtime previously required for policy updates. Evaluations on 8x NVIDIA B300 GPUs connected via NVLink demonstrate that NCCLbpf imposes just 80-130 ns overhead per tuner decision (less than 0.03% of collective latency), prevents all tested unsafe plugin behaviors at load-time, and enables a message-size-aware eBPF policy that improves AllReduce throughput by up to 27% over NCCL's default in the 4-128 MiB range.
Problem

Research questions and friction points this paper is trying to address.

NCCL
GPU collective communication
unverified plugins
runtime safety
policy updates
Innovation

Methods, ideas, or system contributions that make the work stand out.

eBPF
NCCL
verified execution
composable policies
hot-reload
πŸ”Ž Similar Papers
No similar papers found.