Mycroft: Tracing Dependencies in Collective Communication Towards Reliable LLM Training

📅 2025-09-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In large language model (LLM) training, collective communication libraries—treated as black boxes—hinder root-cause diagnosis of reliability issues, leading to resource waste and performance degradation. To address this, we propose a lightweight, dependency-aware distributed tracing mechanism that enables runtime observability into collective communication internals for the first time. Our method constructs dynamic communication dependency graphs by integrating low-overhead tracing, online dependency modeling, and interpretable anomaly analysis. Deployed in ByteDance’s production LLM training infrastructure for over six months, it detects 90% of failures within 15 seconds and precisely identifies root causes in 60% of cases within 20 seconds. Controlled fault-injection experiments confirm its high accuracy and robustness. This work establishes a practical, production-ready observability infrastructure for enhancing the reliability of LLM training systems.

Technology Category

Application Category

📝 Abstract
Reliability is essential for ensuring efficiency in LLM training. However, many real-world reliability issues remain difficult to resolve, resulting in wasted resources and degraded model performance. Unfortunately, today's collective communication libraries operate as black boxes, hiding critical information needed for effective root cause analysis. We propose Mycroft, a lightweight distributed tracing and root cause analysis system designed to address previously hidden reliability issues in collective communication. Mycroft's key idea is to trace collective communication states and leverage internal control and data dependencies to resolve reliability problems in LLM training. Mycroft has been deployed at ByteDance for over six months to debug collective communication related issues at runtime. It detected anomalies within 15 seconds in 90% of cases and identified the root cause within 20 seconds in 60% of cases. We also conducted extensive fault injection experiments to demonstrate Mycroft's capability and efficiency.
Problem

Research questions and friction points this paper is trying to address.

Tracing dependencies in collective communication for LLM training
Resolving hidden reliability issues in distributed systems
Debugging collective communication anomalies at runtime
Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight distributed tracing system
Leverages control and data dependencies
Detects anomalies and root causes quickly
🔎 Similar Papers
No similar papers found.
Yangtao Deng
Yangtao Deng
The Chinese University of Hong Kong
L
Lei Zhang
ByteDance
Q
Qinlong Wang
ByteDance
X
Xiaoyun Zhi
ByteDance
Xinlei Zhang
Xinlei Zhang
ByteDance
Z
Zhuo Jiang
ByteDance
H
Haohan Xu
ByteDance
L
Lei Wang
ByteDance
Zuquan Song
Zuquan Song
Bytedance
G
Gaohong Liu
ByteDance
Y
Yang Bai
ByteDance
S
Shuguang Wang
ByteDance
Wencong Xiao
Wencong Xiao
ByteDance
Distributed systemMachine learning systemResource management
J
Jianxi Ye
ByteDance
Minlan Yu
Minlan Yu
Harvard University
NetworkingSystemsCloud Computing
H
Hong Xu
The Chinese University of Hong Kong