Beyond Pre-Training: The Full Lifecycle of Foundation Models on HPC Systems

📅 2026-04-14

📈 Citations: 0

✨ Influential: 0

career value

234K/year

🤖 AI Summary

This work addresses the challenge of national supercomputing centers struggling to efficiently support the full lifecycle of foundation models—including pretraining, fine-tuning, and inference—by proposing a hybrid cloud-native platform that integrates diskless GPU-accelerated HPE Cray EX nodes with virtualized general-purpose infrastructure. Leveraging Kubernetes for unified orchestration, the platform bridges traditional HPC batch processing and AI-serving workflows, enabling, for the first time in a national supercomputing environment, an end-to-end “AI factory” architecture for foundation models. This approach effectively closes the paradigm gap between high-performance computing and cloud-native AI services, substantially enhancing user productivity and offering a reusable implementation blueprint for integrating end-to-end AI applications into supercomputing centers.

Technology Category

Application Category

📝 Abstract

Large-scale pre-training of Foundational Models (FM) constitutes a computationally intensive first phase for enabling AI across diverse scientific and societal applications. This first phase has positioned High-Performance Computing (HPC) facilities as indispensable backbones of "Sovereign AI" initiatives. While the massive throughput requirements of FM pre-training align with the traditional capability-oriented mission of HPC, subsequent phases of the AI lifecycle, typically referred to as fine-tuning and inference, introduce operational paradigms that can conflict with established batch-processing environments. Moreover, these phases are not computationally trivial: they often require substantial high-end compute resources while exhibiting hardware utilization patterns that differ significantly from those of pre-training. This paper addresses the architectural and strategic challenges of operationalizing a complete AI lifecycle within a national supercomputing facility. We present a hybrid cloud-native platform being developed and deployed at the Swiss National Supercomputing Centre (CSCS) that combines diskless GPU-enabled HPE Cray EX compute nodes with virtualized commodity infrastructure. Orchestrated by Kubernetes, this novel service architecture bridges the gap between HPC batch processing and service-oriented workflows. We report our initial investigations into fine-tuning pipelines and highly available inference services, analyzing the associated trade-offs while improving user productivity. Our findings offer a blueprint for enabling supercomputers to integrate "AI Factories" services and workflows, supporting AI innovations into end-to-end scientific and industrial use cases.

Problem

Research questions and friction points this paper is trying to address.

Foundation Models

HPC Systems

Fine-tuning

Inference

AI Lifecycle

Innovation

Methods, ideas, or system contributions that make the work stand out.

foundation models

HPC

cloud-native