SIGMA: An AI-Empowered Training Stack on Early-Life Hardware

πŸ“… 2025-12-15
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Early AI accelerators face three critical challenges in large-scale training: hardware unreliability, numerical instability, and low parallel efficiency. To address these, this paper introduces LUCIAβ€”a fully open-source, end-to-end training stack comprising a system-level platform and a framework-level layer. LUCIA pioneers a fault-tolerant training architecture tailored for pre-production silicon, integrating adaptive fault detection and lightweight recovery, mixed-precision stability calibration, topology-aware All-to-All communication, dynamic micro-batch scheduling, and hardware noise modeling with compensation. It achieves the first stable, long-duration training of a 200B-parameter Mixture-of-Experts (MoE) model across 2,048 early-stage accelerators: delivering 94.45% cluster effective utilization, only one stability incident over 75 days, a model FLOPs utilization (MFU) of 21.08%, and state-of-the-art accuracy on downstream tasks.

Technology Category

Application Category

πŸ“ Abstract
An increasing variety of AI accelerators is being considered for large-scale training. However, enabling large-scale training on early-life AI accelerators faces three core challenges: frequent system disruptions and undefined failure modes that undermine reliability; numerical errors and training instabilities that threaten correctness and convergence; and the complexity of parallelism optimization combined with unpredictable local noise that degrades efficiency. To address these challenges, SIGMA is an open-source training stack designed to improve the reliability, stability, and efficiency of large-scale distributed training on early-life AI hardware. The core of this initiative is the LUCIA TRAINING PLATFORM (LTP), the system optimized for clusters with early-life AI accelerators. Since its launch in March 2025, LTP has significantly enhanced training reliability and operational productivity. Over the past five months, it has achieved an impressive 94.45% effective cluster accelerator utilization, while also substantially reducing node recycling and job-recovery times. Building on the foundation of LTP, the LUCIA TRAINING FRAMEWORK (LTF) successfully trained SIGMA-MOE, a 200B MoE model, using 2,048 AI accelerators. This effort delivered remarkable stability and efficiency outcomes, achieving 21.08% MFU, state-of-the-art downstream accuracy, and encountering only one stability incident over a 75-day period. Together, these advances establish SIGMA, which not only tackles the critical challenges of large-scale training but also establishes a new benchmark for AI infrastructure and platform innovation, offering a robust, cost-effective alternative to prevailing established accelerator stacks and significantly advancing AI capabilities and scalability. The source code of SIGMA is available at https://github.com/microsoft/LuciaTrainingPlatform.
Problem

Research questions and friction points this paper is trying to address.

Addressing reliability issues from frequent system disruptions and undefined failure modes
Resolving numerical errors and training instabilities that threaten convergence
Overcoming parallelism optimization complexity with unpredictable local noise degrading efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Open-source training stack for early-life AI hardware reliability
LUCIA platform optimizes clusters with high accelerator utilization
Framework trains large MoE models with stability and efficiency
πŸ”Ž Similar Papers
No similar papers found.
L
Lei Qu
Microsoft Research
L
Lianhai Ren
Microsoft Research
P
Peng Cheng
Microsoft Research
R
Rui Gao
Microsoft Research
Ruizhe Wang
Ruizhe Wang
University of Waterloo
T
Tianyu Chen
Microsoft Research
X
Xiao Liu
Microsoft Research
X
Xingjian Zhang
Microsoft Research
Yeyun Gong
Yeyun Gong
Microsoft Research Asia
Natural Language GenerationQuestion AnsweringPre-training
Yifan Xiong
Yifan Xiong
Microsoft Research
Yucheng Ding
Yucheng Ding
Shanghai Jiao Tong University
Device-Cloud ML
Y
Yuting Jiang
Microsoft Research
Zhenghao Lin
Zhenghao Lin
MSRA
NLP
Z
Zhongxin Guo
Microsoft Research
Ziyue Yang
Ziyue Yang
PhD of Chemical Engineering, University of Rochester
BiomoleculesMachine learning