HPC Alongside User-space Kubernetes

πŸ“… 2024-06-11
πŸ›οΈ arXiv.org
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Traditional high-performance computing (HPC) and cloud computing have long remained siloed due to divergent origins, cultures, and technological trajectories, hindering their joint ability to address emerging heterogeneous scientific workloads demanding both agile service orchestration and ultra-low-latency, state-aware performance. Method: This paper introduces β€œConverged Computing,” a novel paradigm enabling co-deployment of the HPC workload manager Flux and user-space Kubernetes (Usernetes) on native supercomputing clusters. Leveraging Linux namespaces, cgroups, and a custom network plugin, it establishes an infrastructure-level convergence architecture. Contribution/Results: The approach unifies cloud-native automation and portability with HPC’s low-latency interconnects, high-bandwidth networking, and fine-grained resource scheduling. Experimental evaluation in hybrid environments demonstrates low-overhead execution of HPC applications and efficient cross-environment communication. An open-source, reproducible deployment framework is provided, offering a practical pathway for HPC centers to adopt cloud-native technologies.

Technology Category

Application Category

πŸ“ Abstract
High performance computing (HPC) and cloud have traditionally been separate, and presented in an adversarial light. The conflict arises from disparate beginnings that led to two drastically different cultures, incentive structures, and communities that are now in direct competition with one another for resources, talent, and speed of innovation. With the emergence of converged computing, a new paradigm of computing has entered the space that advocates for bringing together the best of both worlds from a technological and cultural standpoint. This movement has emerged due to economic and practical needs. Emerging heterogeneous, complex scientific workloads that require an orchestration of services, simulation, and reaction to state can no longer be served by traditional HPC paradigms. However, while cloud offers automation, portability, and orchestration, as it stands now it cannot deliver the network performance, fine-grained resource mapping, or scalability that these same simulations require. These novel requirements call for change not just in workflow software or design, but also in the underlying infrastructure to support them. This is one of the goals of converged computing. While the future of traditional HPC and commercial cloud cannot be entirely known, a reasonable approach to take is one that focuses on new models of convergence, and a collaborative mindset. In this paper, we introduce a new paradigm for compute -- a traditional HPC workload manager, Flux Framework, running seamlessly with a user-space Kubernetes"Usernetes"to bring a service-oriented, modular, and portable architecture directly to on-premises HPC clusters. We present experiments that assess HPC application performance and networking between the environments, and provide a reproducible setup for the larger community to do exactly that.
Problem

Research questions and friction points this paper is trying to address.

Bridging HPC and cloud computing for converged workloads
Enhancing HPC with cloud-like orchestration and portability
Optimizing network performance in converged Kubernetes-HPC systems
Innovation

Methods, ideas, or system contributions that make the work stand out.

Converged Kubernetes on HPC clusters
Flux Framework with user-space Kubernetes
Service-oriented modular portable architecture
πŸ”Ž Similar Papers
No similar papers found.
V
Vanessa V. Sochat
Lawrence Livermore National Laboratory
D
David Fox
Lawrence Livermore National Laboratory
Daniel Milroy
Daniel Milroy
Lawrence Livermore National Laboratory
Numerical AnalysisParallel Computing