Evolving HPC services to enable ML workloads on HPE Cray EX

📅 2025-07-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the inadequate support for dynamic machine learning (ML) workloads on traditional HPC systems—such as HPE Cray EX—this paper proposes an AI-readiness enhancement framework tailored for the Alps supercomputing platform. Methodologically, it introduces a lightweight ML user environment, node-readiness detection, and rapid performance screening tools; establishes an observability data product and service-plane architecture; integrates a GH200 GPU-optimized dedicated storage system; and unifies resource monitoring, security policy enforcement, and service-oriented deployment. The contributions are threefold: (1) significantly improved availability, elasticity, and execution efficiency of ML training and inference tasks on HPC infrastructure; (2) systematic evolution of HPC systems toward an AI-native paradigm; and (3) a reusable, production-ready technical pathway for AI–HPC convergence on exascale platforms.

Technology Category

Application Category

📝 Abstract
The Alps Research Infrastructure leverages GH200 technology at scale, featuring 10,752 GPUs. Accessing Alps provides a significant computational advantage for researchers in Artificial Intelligence (AI) and Machine Learning (ML). While Alps serves a broad range of scientific communities, traditional HPC services alone are not sufficient to meet the dynamic needs of the ML community. This paper presents an initial investigation into extending HPC service capabilities to better support ML workloads. We identify key challenges and gaps we have observed since the early-access phase (2023) of Alps by the Swiss AI community and propose several technological enhancements. These include a user environment designed to facilitate the adoption of HPC for ML workloads, balancing performance with flexibility; a utility for rapid performance screening of ML applications during development; observability capabilities and data products for inspecting ongoing large-scale ML workloads; a utility to simplify the vetting of allocated nodes for compute readiness; a service plane infrastructure to deploy various types of workloads, including support and inference services; and a storage infrastructure tailored to the specific needs of ML workloads. These enhancements aim to facilitate the execution of ML workloads on HPC systems, increase system usability and resilience, and better align with the needs of the ML community. We also discuss our current approach to security aspects. This paper concludes by placing these proposals in the broader context of changes in the communities served by HPC infrastructure like ours.
Problem

Research questions and friction points this paper is trying to address.

Extending HPC services to support ML workloads effectively
Addressing gaps in traditional HPC for dynamic ML needs
Enhancing system usability and resilience for AI/ML researchers
Innovation

Methods, ideas, or system contributions that make the work stand out.

User environment for HPC and ML integration
Performance screening utility for ML applications
Storage infrastructure tailored for ML workloads
🔎 Similar Papers
No similar papers found.
S
Stefano Schuppli
ETH Zurich, Swiss National Supercomputing Centre (CSCS), Lugano, Switzerland
F
Fawzi Mohamed
ETH Zurich, Swiss National Supercomputing Centre (CSCS), Lugano, Switzerland
H
Henrique Mendonça
ETH Zurich, Swiss National Supercomputing Centre (CSCS), Lugano, Switzerland
N
Nina Mujkanovic
ETH Zurich, Swiss National Supercomputing Centre (CSCS), Lugano, Switzerland
E
Elia Palme
ETH Zurich, Swiss National Supercomputing Centre (CSCS), Lugano, Switzerland
D
Dino Conciatore
ETH Zurich, Swiss National Supercomputing Centre (CSCS), Lugano, Switzerland
L
Lukas Drescher
ETH Zurich, Swiss National Supercomputing Centre (CSCS), Lugano, Switzerland
M
Miguel Gila
ETH Zurich, Swiss National Supercomputing Centre (CSCS), Lugano, Switzerland
P
Pim Witlox
ETH Zurich, Swiss National Supercomputing Centre (CSCS), Lugano, Switzerland
Joost VandeVondele
Joost VandeVondele
Deputy Director for science, Head of Research Infrastructure Engineering, CSCS, ETH Zurich
high performance computingsimulation and modellingquantum materials and chemistry
M
Maxime Martinasso
ETH Zurich, Swiss National Supercomputing Centre (CSCS), Lugano, Switzerland
T
Thomas C. Schulthess
ETH Zurich, Swiss National Supercomputing Centre (CSCS), Lugano, Switzerland
Torsten Hoefler
Torsten Hoefler
Professor of Computer Science at ETH Zurich
High Performance ComputingDeep LearningNetworkingMessage Passing InterfaceParallel and Distributed Computing