Minder: Faulty Machine Detection for Large-scale Distributed Model Training

📅 2024-11-04
🏛️ arXiv.org
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
To address training interruptions caused by frequent hardware failures in large-scale distributed model training, this paper proposes a lightweight, time-series–aware fault-adaptive detection method. Targeting thousand-node–scale training environments, it introduces an end-to-end detection framework that jointly leverages multi-dimensional monitoring time-series analysis, anomaly pattern clustering, and dynamic thresholding—enabling, for the first time, fine-grained, low-latency fault预警 for distributed training tasks. Compared to manual inspection and generic anomaly detection approaches, our method achieves sub-second responsiveness (mean latency: 3.6 seconds), high-precision fault localization (precision: 0.904; F1-score: 0.893), and strong robustness under heterogeneous failure modes. Deployed in production for over one year, it has significantly improved training continuity and operational efficiency.

Technology Category

Application Category

📝 Abstract
Large-scale distributed model training requires simultaneous training on up to thousands of machines. Faulty machine detection is critical when an unexpected fault occurs in a machine. From our experience, a training task can encounter two faults per day on average, possibly leading to a halt for hours. To address the drawbacks of the time-consuming and labor-intensive manual scrutiny, we propose Minder, an automatic faulty machine detector for distributed training tasks. The key idea of Minder is to automatically and efficiently detect faulty distinctive monitoring metric patterns, which could last for a period before the entire training task comes to a halt. Minder has been deployed in our production environment for over one year, monitoring daily distributed training tasks where each involves up to thousands of machines. In our real-world fault detection scenarios, Minder can accurately and efficiently react to faults within 3.6 seconds on average, with a precision of 0.904 and F1-score of 0.893.
Problem

Research questions and friction points this paper is trying to address.

Detects faulty machines in large-scale distributed model training
Reduces manual scrutiny by automating fault detection
Identifies distinctive metric patterns before training halts
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automatically detects faulty machine metric patterns
Monitors large-scale distributed training tasks
Achieves high precision and fast fault reaction
🔎 Similar Papers
2024-06-07International Symposium on High-Performance Computer ArchitectureCitations: 5
Yangtao Deng
Yangtao Deng
The Chinese University of Hong Kong
X
Xiang Shi
ByteDance
Z
Zhuo Jiang
ByteDance
X
Xingjian Zhang
Tsinghua University
L
Lei Zhang
ByteDance
Z
Zhang Zhang
ByteDance
B
Bo Li
ByteDance
Zuquan Song
Zuquan Song
Bytedance
Hang Zhu
Hang Zhu
Johns Hopkins University
Computer SystemsMachine Learning SystemsCloud Computing
G
Gaohong Liu
ByteDance
Fuliang Li
Fuliang Li
Northeastern University
S
Shuguang Wang
ByteDance
Haibin Lin
Haibin Lin
Bytedance
Machine Learning SystemsNatural Language Processing
J
Jia-jun Ye
ByteDance
Minlan Yu
Minlan Yu
Harvard University
NetworkingSystemsCloud Computing