π€ AI Summary
Early AI accelerators face three critical challenges in large-scale training: hardware unreliability, numerical instability, and low parallel efficiency. To address these, this paper introduces LUCIAβa fully open-source, end-to-end training stack comprising a system-level platform and a framework-level layer. LUCIA pioneers a fault-tolerant training architecture tailored for pre-production silicon, integrating adaptive fault detection and lightweight recovery, mixed-precision stability calibration, topology-aware All-to-All communication, dynamic micro-batch scheduling, and hardware noise modeling with compensation. It achieves the first stable, long-duration training of a 200B-parameter Mixture-of-Experts (MoE) model across 2,048 early-stage accelerators: delivering 94.45% cluster effective utilization, only one stability incident over 75 days, a model FLOPs utilization (MFU) of 21.08%, and state-of-the-art accuracy on downstream tasks.
π Abstract
An increasing variety of AI accelerators is being considered for large-scale training. However, enabling large-scale training on early-life AI accelerators faces three core challenges: frequent system disruptions and undefined failure modes that undermine reliability; numerical errors and training instabilities that threaten correctness and convergence; and the complexity of parallelism optimization combined with unpredictable local noise that degrades efficiency. To address these challenges, SIGMA is an open-source training stack designed to improve the reliability, stability, and efficiency of large-scale distributed training on early-life AI hardware. The core of this initiative is the LUCIA TRAINING PLATFORM (LTP), the system optimized for clusters with early-life AI accelerators. Since its launch in March 2025, LTP has significantly enhanced training reliability and operational productivity. Over the past five months, it has achieved an impressive 94.45% effective cluster accelerator utilization, while also substantially reducing node recycling and job-recovery times. Building on the foundation of LTP, the LUCIA TRAINING FRAMEWORK (LTF) successfully trained SIGMA-MOE, a 200B MoE model, using 2,048 AI accelerators. This effort delivered remarkable stability and efficiency outcomes, achieving 21.08% MFU, state-of-the-art downstream accuracy, and encountering only one stability incident over a 75-day period. Together, these advances establish SIGMA, which not only tackles the critical challenges of large-scale training but also establishes a new benchmark for AI infrastructure and platform innovation, offering a robust, cost-effective alternative to prevailing established accelerator stacks and significantly advancing AI capabilities and scalability. The source code of SIGMA is available at https://github.com/microsoft/LuciaTrainingPlatform.