🤖 AI Summary
Reliability of large-scale ML training clusters is increasingly critical, yet the scaling behavior of failure impact remains poorly understood. This paper analyzes 11 months of operational data from two multi-tenant clusters—comprising 4 million ML training jobs and 150 million A100 GPU-hours—to establish, for the first time, a fine-grained failure taxonomy and reliability metric suite tailored to ML workloads. We propose the Effective Training Time Ratio (ETTR) modeling framework, enabling quantitative pre-evaluation of software-based fault-mitigation strategies. Through MTTF fitting and workload characterization, we reveal that although small tasks exhibit higher per-job robustness, their sheer volume makes them the dominant contributor to overall efficiency degradation. Our contributions include a scalable failure prediction model and a methodology for reliability assessment at scale—providing empirical foundations and theoretical insights for reliability-aware design of large-scale ML infrastructure.
📝 Abstract
Reliability is a fundamental challenge in operating large-scale machine learning (ML) infrastructures, particularly as the scale of ML models and training clusters continues to grow. Despite decades of research on infrastructure failures, the impact of job failures across different scales remains unclear. This paper presents a view of managing two large, multi-tenant ML clusters, providing quantitative analysis, operational experience, and our own perspective in understanding and addressing reliability concerns at scale. Our analysis reveals that while large jobs are most vulnerable to failures, smaller jobs make up the majority of jobs in the clusters and should be incorporated into optimization objectives. We identify key workload properties, compare them across clusters, and demonstrate essential reliability requirements for pushing the boundaries of ML training at scale. We hereby introduce a taxonomy of failures and key reliability metrics, analyze 11 months of data from two state-of-the-art ML environments with 4 million jobs and over 150 million A100 GPU hours. Building on our data, we fit a failure model to project Mean Time to Failure for various GPU scales. We further propose a method to estimate a related metric, Effective Training Time Ratio, as a function of job parameters, and we use this model to gauge the efficacy of potential software mitigations at scale. Our work provides valuable insights and future research directions for improving the reliability of AI supercomputer clusters, emphasizing the need for flexible, workload-agnostic, and reliability-aware infrastructure, system software, and algorithms.