🤖 AI Summary
To address the challenge that HPE Slingshot RDMA networks natively lack multi-tenancy support and cannot be securely shared in Kubernetes container environments, this paper proposes the first Slingshot-native multi-tenant RDMA access framework. Our approach introduces container-grained network isolation by integrating a custom CNI plugin, hardware driver optimizations, and Linux namespace-based isolation—enabling fine-grained, zero-trust RDMA resource allocation within Kubernetes. Crucially, it avoids RDMA virtualization overheads, achieving >98% of native RDMA throughput and <2% latency increase, while scaling to thousands of concurrently accessing Pods. The key contribution is the first secure, high-performance extension of Slingshot RDMA capabilities to cloud-native, multi-tenant HPC-Cloud converged environments—delivering a production-ready network infrastructure for high-performance containerized computing.
📝 Abstract
Converged HPC-Cloud computing is an emerging computing paradigm that aims to support increasingly complex and multi-tenant scientific workflows. These systems require reconciliation of the isolation requirements of native cloud workloads and the performance demands of HPC applications. In this context, networking hardware is a critical boundary component: it is the conduit for high-throughput, low-latency communication and enables isolation across tenants. HPE Slingshot is a high-speed network interconnect that provides up to 200 Gbps of throughput per port and targets high-performance computing (HPC) systems. The Slingshot host software, including hardware drivers and network middleware libraries, is designed to meet HPC deployments, which predominantly use single-tenant access modes. Hence, the Slingshot stack is not suited for secure use in multi-tenant deployments, such as converged HPC-Cloud deployments. In this paper, we design and implement an extension to the Slingshot stack targeting converged deployments on the basis of Kubernetes. Our integration provides secure, container-granular, and multi-tenant access to Slingshot RDMA networking capabilities at minimal overhead.