Beyond Pre-Training: The Full Lifecycle of Foundation Models on HPC Systems

📅 2026-04-14
📈 Citations: 0
Influential: 0
📄 PDF

career value

235K/year
🤖 AI Summary
This work addresses the challenge of national supercomputing centers struggling to efficiently support the full lifecycle of foundation models—including pretraining, fine-tuning, and inference—by proposing a hybrid cloud-native platform that integrates diskless GPU-accelerated HPE Cray EX nodes with virtualized general-purpose infrastructure. Leveraging Kubernetes for unified orchestration, the platform bridges traditional HPC batch processing and AI-serving workflows, enabling, for the first time in a national supercomputing environment, an end-to-end “AI factory” architecture for foundation models. This approach effectively closes the paradigm gap between high-performance computing and cloud-native AI services, substantially enhancing user productivity and offering a reusable implementation blueprint for integrating end-to-end AI applications into supercomputing centers.

Technology Category

Application Category

📝 Abstract
Large-scale pre-training of Foundational Models (FM) constitutes a computationally intensive first phase for enabling AI across diverse scientific and societal applications. This first phase has positioned High-Performance Computing (HPC) facilities as indispensable backbones of "Sovereign AI" initiatives. While the massive throughput requirements of FM pre-training align with the traditional capability-oriented mission of HPC, subsequent phases of the AI lifecycle, typically referred to as fine-tuning and inference, introduce operational paradigms that can conflict with established batch-processing environments. Moreover, these phases are not computationally trivial: they often require substantial high-end compute resources while exhibiting hardware utilization patterns that differ significantly from those of pre-training. This paper addresses the architectural and strategic challenges of operationalizing a complete AI lifecycle within a national supercomputing facility. We present a hybrid cloud-native platform being developed and deployed at the Swiss National Supercomputing Centre (CSCS) that combines diskless GPU-enabled HPE Cray EX compute nodes with virtualized commodity infrastructure. Orchestrated by Kubernetes, this novel service architecture bridges the gap between HPC batch processing and service-oriented workflows. We report our initial investigations into fine-tuning pipelines and highly available inference services, analyzing the associated trade-offs while improving user productivity. Our findings offer a blueprint for enabling supercomputers to integrate "AI Factories" services and workflows, supporting AI innovations into end-to-end scientific and industrial use cases.
Problem

Research questions and friction points this paper is trying to address.

Foundation Models
HPC Systems
Fine-tuning
Inference
AI Lifecycle
Innovation

Methods, ideas, or system contributions that make the work stand out.

foundation models
HPC
cloud-native
Kubernetes
AI lifecycle
🔎 Similar Papers
No similar papers found.
D
Dino Conciatore
Swiss National Supercomputing Centre, ETH Zurich, Lugano, Switzerland
E
Elia Oggian
Swiss National Supercomputing Centre, ETH Zurich, Lugano, Switzerland
F
Federico Da Forno
Swiss National Supercomputing Centre, ETH Zurich, Lugano, Switzerland
S
Stefano Schuppli
Swiss National Supercomputing Centre, ETH Zurich, Lugano, Switzerland
J
Jerome Tissieres
Swiss National Supercomputing Centre, ETH Zurich, Lugano, Switzerland
Joost VandeVondele
Joost VandeVondele
Deputy Director for science, Head of Research Infrastructure Engineering, CSCS, ETH Zurich
high performance computingsimulation and modellingquantum materials and chemistry
M
Maxime Martinasso
Swiss National Supercomputing Centre, ETH Zurich, Lugano, Switzerland