Training Foundation Models on a Full-Stack AMD Platform: Compute, Networking, and System Design

📅 2025-11-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Despite growing interest in Mixture-of-Experts (MoE) models, large-scale end-to-end pretraining of MoE architectures on pure AMD hardware—specifically the MI300X GPU and Pollara interconnect—remains unexplored, raising questions about the platform’s readiness for state-of-the-art LLM training. Method: We introduce MI300X-aware Transformer module sizing guidelines, conduct full-stack Pollara communication microbenchmarks, and integrate optimized All-reduce/Reduce-scatter primitives, fault-tolerant training, and checkpoint reshaping. Contribution/Results: We successfully complete end-to-end pretraining of the ZAYA1-base MoE model (760M activated parameters, 8.3B total parameters) on native AMD infrastructure. Evaluated on reasoning, mathematics, and code generation, ZAYA1-base substantially outperforms Llama-3-8B and OLMoE, matching the performance of Qwen3-4B and Gemma3-12B. This demonstrates, for the first time, that AMD’s hardware-software stack achieves full maturity for high-performance MoE model training.

Technology Category

Application Category

📝 Abstract
We report on the first large-scale mixture-of-experts (MoE) pretraining study on pure AMD hardware, utilizing both MI300X GPUs with Pollara interconnect. We distill practical guidance for both systems and model design. On the systems side, we deliver a comprehensive cluster and networking characterization: microbenchmarks for all core collectives (all-reduce, reduce-scatter, all-gather, broadcast) across message sizes and GPU counts on Pollara. To our knowledge, this is the first at this scale. We further provide MI300X microbenchmarks on kernel sizing and memory bandwidth to inform model design. On the modeling side, we introduce and apply MI300X-aware transformer sizing rules for attention and MLP blocks and justify MoE widths that jointly optimize training throughput and inference latency. We describe our training stack in depth, including often-ignored utilities such as fault-tolerance and checkpoint-reshaping, as well as detailed information on our training recipe. We also provide a preview of our model architecture and base model - ZAYA1 (760M active, 8.3B total parameters MoE) - which will be further improved upon in forthcoming papers. ZAYA1-base achieves performance comparable to leading base models such as Qwen3-4B and Gemma3-12B at its scale and larger, and outperforms models including Llama-3-8B and OLMoE across reasoning, mathematics, and coding benchmarks. Together, these results demonstrate that the AMD hardware, network, and software stack are mature and optimized enough for competitive large-scale pretraining.
Problem

Research questions and friction points this paper is trying to address.

Develops AMD hardware optimization for large-scale mixture-of-experts foundation model training
Provides system design guidance through cluster networking characterization and microbenchmarks
Introduces transformer sizing rules optimizing training throughput and inference latency
Innovation

Methods, ideas, or system contributions that make the work stand out.

First large-scale MoE pretraining on pure AMD hardware
Comprehensive cluster and networking characterization with microbenchmarks
MI300X-aware transformer sizing rules for optimized model design
🔎 Similar Papers
No similar papers found.
Quentin Anthony
Quentin Anthony
PhD Student, Ohio State University
HPCDeep LearningParallel Computing
Y
Yury Tokpanov
Zyphra
S
Skyler Szot
Zyphra
S
Srivatsan Rajagopal
Zyphra
P
Praneeth Medepalli
Zyphra
Anna Golubeva
Anna Golubeva
Zyphra
V
Vasu Shyam
Zyphra
R
Robert Washbourne
Zyphra
R
Rishi Iyer
Zyphra
A
Ansh Chaurasia
Zyphra
T
Tomas Figliolia
Zyphra
X
Xiao Yang
Zyphra
D
Drew Thorstensen
IBM
A
Amartey Pearson
IBM
Z
Zack Grossbart
IBM
J
Jason van Patten
IBM
Emad Barsoum
Emad Barsoum
AMD, Columbia University
Generative AIFoundation ModelsAgentic AIComputer VisionML Frameworks
Zhenyu Gu
Zhenyu Gu
AMD
high performance computingdeep learningEDA
Y
Yao Fu
AMD
Beren Millidge
Beren Millidge
Postdoctoral Researcher, University of Oxford