An Engineering Journey Training Large Language Models at Scale on Alps: The Apertus Experience

📅 2026-04-14

📈 Citations: 0

✨ Influential: 0

career value

248K/year

🤖 AI Summary

This work addresses the critical gap in engineering capacity and infrastructure within public research institutions that hinders the development of sovereign AI. Leveraging the European Alps supercomputer equipped with NVIDIA GH200 Grace Hopper Superchips, the study reports the first successful academic pretraining of Apertus, a 70-billion-parameter open-source multilingual large language model. By constructing a software-defined machine learning platform integrating large-scale distributed training, storage optimization, and high-speed interconnect stability techniques, the project overcomes key engineering bottlenecks in deploying high-performance computing systems for AI workloads. The resulting architecture provides a scalable and sustainably upgradable foundation for machine learning, thereby advancing both open science and the establishment of sovereign AI capabilities.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have surged as a transformative technology for science and society, prompting governments worldwide to pursue sovereign AI capabilities that ensure data compliance and cultural representation. However, the associated capital costs and engineering complexity required to train these models have largely restricted such capabilities to the private sector, leaving a significant gap for public institutions. This paper details the engineering journey behind training \textit{Apertus}, a fully open multilingual foundation model, on the \textit{Alps} supercomputer. Representing a first-of-its-kind achievement for academia at the 70B parameter scale, we successfully deployed a massive pre-training campaign on one of Europe's largest systems for open science, powered by NVIDIA GH200 Grace Hopper Superchips. We detail the challenges encountered in readying HPC infrastructure for training AI models, from overcoming storage bottlenecks to stabilizing large-scale interconnects, and the lessons learned in transforming a supercomputer into a resilient software-defined Machine Learning Platform. Finally, we discuss the post-training requirements and evolution of our Machine Learning platform, outlining how this initial release lays the groundwork for a sustained, iterative operational capability, in particular for fine tuning foundation models, that extends well beyond a single model training run.

Problem

Research questions and friction points this paper is trying to address.

Large Language Models

sovereign AI

HPC infrastructure

engineering complexity

public institutions

Innovation

Methods, ideas, or system contributions that make the work stand out.

large language models

supercomputing

sovereign AI