AIReSim: A Discrete Event Simulator for Large-scale AI Cluster Reliability Modeling

๐Ÿ“… 2026-03-07
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Large-scale AI clusters often suffer from reduced utilization due to failures, and recovery incurs substantial costs, necessitating systematic evaluation of how fault tolerance and scheduling configurations impact system reliability. To address this, this work proposes AIReSimโ€”the first discrete-event simulation framework tailored for configurable large-scale AI clusters. AIReSim integrates modules for failure modeling, checkpoint-based recovery, job scheduling, and resource repair, enabling joint parameter tuning, sensitivity analysis, and exploration of โ€œwhat-ifโ€ scenarios. Case studies demonstrate that AIReSim effectively identifies critical parameters to guide capacity planning and system optimization, thereby enhancing resource efficiency while maintaining reliability.

Technology Category

Application Category

๐Ÿ“ Abstract
Failures in clusters running large-scale AI workloads can result in decreased utilization. Because the cost of a failure in such AI workloads is high (as it requires restarting the entire job from a previous checkpoint), there are many mechanisms in place to ensure that the failures are mitigated, and the impact of a failure is minimized. However, these mechanisms have many knobs and parameters, all of which must be carefully tuned based on the system and cluster's characteristics. We built AIReSim, a discrete event simulator to evaluate the different design choices during the failure, recovery, scheduling and repair processes for a cluster running a large-scale AI workload. AIReSim allows the system designer to systematically evaluate the effects of the different knobs and parameters on the overall end-to-end reliability of the system. Further, AIReSim can be used to identify which knobs or parameters are important in order to prioritize the investment of effort in improving the system. AIReSim also allows tuning of the knobs for achieving different tradeoffs in the system, as well as to consider various ``what-if''scenarios. We present a case study of applying AIReSim for capacity planning for large-scale clusters running AI workloads.
Problem

Research questions and friction points this paper is trying to address.

AI cluster reliability
failure recovery
discrete event simulation
large-scale AI workloads
system parameter tuning
Innovation

Methods, ideas, or system contributions that make the work stand out.

discrete event simulation
AI cluster reliability
failure recovery
system parameter tuning
capacity planning
๐Ÿ”Ž Similar Papers