🤖 AI Summary
In large-scale distributed AI training, frequent hardware failures and communication congestion lead to GPU resource underutilization and increased latency. To address these challenges, this paper proposes C4, a communication-driven framework. Its core contributions are: (1) the first real-time hardware anomaly detection and isolation mechanism, leveraging distinctive symptoms of collective communication patterns to achieve millisecond-level failure identification; and (2) a deterministic traffic orchestration method tailored for long-duration communication flows, jointly optimizing fault response time and bandwidth contention efficiency. Integrated with communication behavior modeling, dynamic traffic scheduling, and fault-resilient restart mechanisms, C4 was deployed in an ultra-large-scale cloud production environment. Results show a 30–45% improvement in training efficiency, a 30% reduction in failure-related overhead, and a 15% decrease in communication cost.
📝 Abstract
The emergence of Large Language Models (LLMs) has necessitated the adoption of distributed training techniques, involving the deployment of thousands of GPUs to train a single model. Unfortunately, the efficiency of large-scale distributed training systems is often suboptimal due to the increased likelihood of hardware errors in high-end GPU products and the heightened risk of network traffic collisions. Specifically, GPUs involved in the same job require periodic synchronization to exchange necessary data, such as gradients, parameters, or activations. As a result, any local hardware failure can disrupt training tasks, and the inability to swiftly identify faulty components leads to a significant waste of GPU resources. Moreover, prolonged communication due to traffic collisions can substantially increase GPU waiting times. To address these challenges, we propose a communication-driven solution, namely the C 4. The key insights of C 4 are twofold. First, the load in distributed training exhibits homogeneous characteristics and is divided into iterations through periodic synchronization, therefore hardware anomalies would incur certain syndrome in collective communication. By leveraging this feature, $mathbf{C} 4$ can rapidly identify the faulty components, swiftly isolate the anomaly, and restart the task, thereby avoiding resource wastage caused by delays in anomaly detection. Second, the predictable communication model of collective communication, involving a limited number of long-lived flows, allows C 4 to efficiently execute traffic planning, substantially reducing bandwidth competition among these flows. The $mathbf{C 4}$ has been extensively deployed across real-world production systems in a hyperscale cloud provider, yielding a significant improvement in system efficiency, from 30% to $mathbf{4 5 %}$. This enhancement is attributed to a $mathbf{3 0 %}$ reduction in error-induced overhead and a 15% reduction in communication costs.