From Detection to Recovery: Operational Analysis on LLM Pre-training with 504 GPUs

📅 2026-05-10

📈 Citations: 0

✨ Influential: 0

career value

224K/year

🤖 AI Summary

This study addresses the challenges of hardware failures and multi-node storage I/O bottlenecks in large-scale LLM pretraining, which are rarely exposed in small-scale tests yet critically undermine training stability and recovery efficiency. Leveraging 55 days of monitoring data and 73 days of operational logs from a 504-GPU cluster, this work presents the first analysis—within a cross-organizational shared monitoring framework—of NFS RPC-layer saturation unique to large-scale training. The authors propose a multi-signal fusion fault detection strategy coupled with an automated retry mechanism. Experimental results demonstrate 100% GPU fault detection accuracy with a low false-positive rate of approximately 0.84 incidents per day, and an automated retry success rate of 33.3%, which is 2.7 times higher than manual recovery, substantially enhancing system observability and fault tolerance.

📝 Abstract

Large-scale AI training is now fundamentally a distributed systems problem, and hardware failures have become routine operating conditions rather than rare exceptions. Public operational evidence from production training clusters, however, remains scarce. This technical report presents an empirical analysis of a 63-node NVIDIA B200 production cluster (504 GPUs), using 55 days of Prometheus time-series data and 73 days of operational logs covering 224 multi-node training sessions. The cluster operates within a cross-organizational environment in which five parties (SKT, Upstage, Lablup, NVIDIA Korea, and VAST Data) share a unified monitoring pipeline. This arrangement enabled joint diagnosis of a 60-node-scale storage I/O bottleneck that did not appear at 2-4-node scale, a production-scale phenomenon no single team could isolate alone. Drawing on a months-long pre-training campaign, we perform three quantitative analyses yielding four findings. First, statistical analysis over 751 Prometheus metrics and 10 XID-identified GPU failures achieves a 10/10 detection rate (2/10 pre-XID) at ~0.84 false positives per day. No single metric is consistently dominant across failure types, motivating a multi-signal detection strategy. Second, profiling 523 checkpoint events along the GPU VRAM to NFS path attributes the "bandwidth paradox" (1.4-10.4% utilization of 200 Gbps RoCE) to saturation of the 128-slot NFS RPC layer. Third, multi-node failure response shows concentrated exclusions (top 3 of 63 nodes account for >50% of all exclusions) and an auto-retry chain success rate of 33.3% over 12 chains (73 attempts), 2.7x the 12.5% manual recovery rate; the median retry interval is 11 min (IQR 10-11). All analyses are grounded in production infrastructure providing session-level workload management, GPU-centric scheduling, and unified observability.

Problem

Research questions and friction points this paper is trying to address.

hardware failures

distributed training

storage I/O bottleneck

fault detection

multi-node recovery

Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-signal failure detection

storage I/O bottleneck

NFS RPC saturation