An Engineering Journey Training Large Language Models at Scale on Alps: The Apertus Experience

📅 2026-04-14
📈 Citations: 0
Influential: 0
📄 PDF

career value

239K/year
🤖 AI Summary
This work addresses the critical gap in engineering capacity and infrastructure within public research institutions that hinders the development of sovereign AI. Leveraging the European Alps supercomputer equipped with NVIDIA GH200 Grace Hopper Superchips, the study reports the first successful academic pretraining of Apertus, a 70-billion-parameter open-source multilingual large language model. By constructing a software-defined machine learning platform integrating large-scale distributed training, storage optimization, and high-speed interconnect stability techniques, the project overcomes key engineering bottlenecks in deploying high-performance computing systems for AI workloads. The resulting architecture provides a scalable and sustainably upgradable foundation for machine learning, thereby advancing both open science and the establishment of sovereign AI capabilities.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) have surged as a transformative technology for science and society, prompting governments worldwide to pursue sovereign AI capabilities that ensure data compliance and cultural representation. However, the associated capital costs and engineering complexity required to train these models have largely restricted such capabilities to the private sector, leaving a significant gap for public institutions. This paper details the engineering journey behind training \textit{Apertus}, a fully open multilingual foundation model, on the \textit{Alps} supercomputer. Representing a first-of-its-kind achievement for academia at the 70B parameter scale, we successfully deployed a massive pre-training campaign on one of Europe's largest systems for open science, powered by NVIDIA GH200 Grace Hopper Superchips. We detail the challenges encountered in readying HPC infrastructure for training AI models, from overcoming storage bottlenecks to stabilizing large-scale interconnects, and the lessons learned in transforming a supercomputer into a resilient software-defined Machine Learning Platform. Finally, we discuss the post-training requirements and evolution of our Machine Learning platform, outlining how this initial release lays the groundwork for a sustained, iterative operational capability, in particular for fine tuning foundation models, that extends well beyond a single model training run.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
sovereign AI
HPC infrastructure
engineering complexity
public institutions
Innovation

Methods, ideas, or system contributions that make the work stand out.

large language models
supercomputing
sovereign AI
software-defined ML platform
multilingual foundation model
🔎 Similar Papers
No similar papers found.
J
Jonathan Coles
Swiss National Supercomputing Centre (CSCS)
S
Stefano Schuppli
Swiss National Supercomputing Centre (CSCS)
L
Lukas Drescher
Swiss National Supercomputing Centre (CSCS)
F
Fawzi Roberto Mohamed
Swiss National Supercomputing Centre (CSCS)
E
Elia Palme
Swiss National Supercomputing Centre (CSCS)
H
Henrique Mendonça
Swiss National Supercomputing Centre (CSCS)
M
Miguel Gila
Swiss National Supercomputing Centre (CSCS)
Mark Klein
Mark Klein
MIT Center for Collective Intelligence
collective intelligenceartificial intelligencemulti-agent systemssustainability
M
Maxime Martinasso
Swiss National Supercomputing Centre (CSCS)
Joost VandeVondele
Joost VandeVondele
Deputy Director for science, Head of Research Infrastructure Engineering, CSCS, ETH Zurich
high performance computingsimulation and modellingquantum materials and chemistry
Torsten Hoefler
Torsten Hoefler
Professor of Computer Science at ETH Zurich
High Performance ComputingDeep LearningNetworkingMessage Passing InterfaceParallel and Distributed Computing
Thomas Schulthess
Thomas Schulthess
Professor of Physics, ETH Zurich; Director, Swiss Natl Supercomp Center (CSCS); Oak Ridge Natl Lab
Computational PhysicsComputational ScienceSupercomputingMaterials ScienceCondensed Matter Physics
J
Josh Romero
NVIDIA
I
Igor Gorodetsky
HPE
R
Ryan Hankins
HPE
I
Isa Wazirzada
HPE
Martin Jaggi
Martin Jaggi
EPFL
Machine LearningOptimization
Antoine Bosselut
Antoine Bosselut
EPFL
Natural Language ProcessingMachine LearningCommonsense Representation and Reasoning
Imanol Schlag
Imanol Schlag
ETH AI Center
Responsible AILarge Language ModelsAssociative RNNs / DeltaNet
A
Antoni-Joan Solergibert i Llaquet
EPFL
A
Alejandro Hernández Cano
EPFL
T
Theofilos Ioannis Manitaras
N
Nicholas John Browning